Hybrid interior-point alternating directions algorithm for support vector machines and feature selection

ABSTRACT

A method for training a classifier for selecting features in sparse data sets with high feature dimensionality includes providing a set of data items x and labels y, minimizing a functional of the data items x and associated labels y 
               L   ⁡     (     w   ,   b   ,   a   ,   c   ,     γ   1     ,     γ   2       )       :=         1   N     ⁢       ∑     i   =   1     N     ⁢     a   i         +       λ   1     ⁢          c        1       +         λ   2     2     ⁢          w        2   2       +       γ   1   T     ⁡     (     e   -     Y   ⁡     (     Xw   +   be     )       -   a     )       +       γ   2   T     ⁡     (     w   -   c     )       +         μ   1     2     ⁢            e   -     Y   ⁡     (     Xw   +   be     )       -   a          2   2       +         μ   2     2     ⁢            w   -   c          2   2               
to solve for hyperplane w and offset b of a classifier by successively iteratively approximating w and b, auxiliary variables a and c, and multiplier vectors γ 1  and γ 2 , wherein λ 1 , λ 2 , μ 1 , and μ 2  are predetermined constants, e is a unit vector, and X and Y are respective matrix representations of the data items x and labels y; providing non-zero elements of the hyperplane vector w and corresponding components of X and Y as arguments to an interior point method solver to solve for hyperplane vector w and offset b, wherein w and b define a classifier than can associate each data item x with the correct label y.

CROSS REFERENCE TO RELATED UNITED STATES APPLICATIONS

This application claims priority from “HIPAD—A fast Hybrid Interior-Point Alternating Directions algorithm for SVM and feature selection”, U.S. Provisional Application No. 61/536,092 of Qin, et al., filed Sep. 19, 2011, the contents of which are herein incorporated by reference in their entirety.

TECHNICAL FIELD

This disclosure is directed to methods for classifying data sets with large feature dimensions in machine learning applications.

DISCUSSION OF THE RELATED ART

Classification tasks on data sets with large feature dimensions are very common in real-world machine learning applications. Typical examples are microarray data for gene selection and text documents for topic classification. Despite the large number of features present in these data sets, usually only small subsets of the features are relevant for the particular learning task, and local correlations among the features are often observed. Hence, feature selection is required for good model interpretability. However, when data is scarce, it is crucial that little prediction accuracy be lost due to errors incurred during optimization during training. Data scarcity is common for biomedical applications, e.g. fMRI data and gene expression data for specific tumors. Another typical case is training classifiers at low levels of a class hierarchy in a hierarchical classification setting, where the training set is small because the labels are very specific. In addition, whenever prior domain knowledge is available, it should be used as much as possible to improve prediction accuracy.

The above considerations suggest three desired features for a learning algorithm: (1) The algorithm should be able to automatically select features/variables which are possibly grouped and highly correlated. (2) The algorithm should be able to efficiently optimize in the training phase with high accuracy. (3) The learning model needs to be flexible enough so that domain knowledge can be easily incorporated. Existing methods are known that individually satisfy the above requirements individually. To handle local correlation among groups of features, the elastic-net regularization has been successfully applied to support vector machines (ENSVM), however, incorporating extra domain knowledge is challenging. An alternating direction method of multipliers (ADMM) has been proposed for the elastic-net SVM which can quickly find an approximate solution to the ENSVM, but ADMM is known to be slow to reach high accuracy. The interior-point methods (IPM) for SVM are known to be able to achieve high accuracy in their solutions with a polynomial iteration complexity, and the dual SVM formulation is independent of the feature space dimensionality. However, because of the need to solve a Newton system in each iteration, the efficiency of IPM quickly deteriorates with increasing feature dimension size.

SUMMARY

Exemplary embodiments of the invention as described herein generally are directed to classification tasks in the regime of scarce labeled training data in high dimensional feature space, where specific expert knowledge is also available. A hybrid learning algorithm according to an embodiment of the invention solves an elastic-net support vector machine (ENSVM) through an alternating direction method of multipliers (ADMM) in a first phase, followed by an interior-point method (IPM) for a standard SVM in a second phase. An algorithm according to an embodiment of the invention addresses the challenges of automatic feature selection, high optimization accuracy, and algorithmic flexibility for knowledge incorporation. Embodiments of the invention have been compared with existing methods on a collection of synthetic and real-world data.

According to an aspect of the invention, there is provided a method for training a classifier for selecting features in sparse data sets with high feature dimensionality, including providing a set of N data items x and associated labels y, each data item being a vector in R^(m) where m>>1, minimizing a functional of the data items x and associated labels y

${L\left( {w,b,a,c,\gamma_{1},\gamma_{2}} \right)}:={{\frac{1}{N}{\sum\limits_{i = 1}^{N}a_{i}}} + {\lambda_{1}{c}_{1}} + {\frac{\lambda_{2}}{2}{w}_{2}^{2}} + {\gamma_{1}^{T}\left( {e - {Y\left( {{Xw} + {be}} \right)} - a} \right)} + {\gamma_{2}^{T}\left( {w - c} \right)} + {\frac{\mu_{1}}{2}{{e - {Y\left( {{Xw} + {be}} \right)} - a}}_{2}^{2}} + {\frac{\mu_{2}}{2}{{w - c}}_{2}^{2}}}$ formed from a program

${\min\limits_{w,b,a,c}{\frac{1}{N}{\sum\limits_{i = 1}^{N}a_{i}}}} + {\lambda_{1}{c}_{1}} + {\frac{\lambda_{2}}{2}{w}_{2}^{2}}$ subject to

$\left\{ \begin{matrix} {{a = {e - {Y\left( {{Xw} + {be}} \right)}}},} \\ {c = {w.}} \end{matrix}\quad \right.$ to solve for hyperplane w and offset b of a classifier using an alternating direction method of multipliers that successively iteratively approximates w and b, auxiliary variables a and c, and multiplier vectors γ₁ and γ₂, wherein λ₁, λ₂, μ₁, and μ₂ are predetermined constants, e is a unit vector, and X and Y are respective matrix representations of the data items x and their associated labels y, and providing non-zero elements of the hyperplane vector w and those components of X and Y that correspond to the non-zero elements of the hyperplane vector w as arguments to a convex quadratic program solver that uses an interior point method to solve for hyperplane vector w and offset b, wherein the hyperplane vector w and offset b define a classifier than can associate each data item x with the correct label y.

According to a further aspect of the invention, the alternating direction method of multipliers includes, at each iteration k, updating (w^(k), b^(k)) by solving

${{\begin{pmatrix} {{\left( {\lambda_{2} + \mu_{2}} \right)I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}} & {\mu_{1}N} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} {{X^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}X^{T}{Y\left( {a^{k} - e} \right)}} - \gamma_{2}^{k} + {\mu_{2}c^{k}}} \\ {{e^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}e^{T}{Y\left( {a_{k} - e} \right)}}} \end{pmatrix}},$ updating a from

$\left. a^{k + 1}\leftarrow{S_{{1/N}\;\mu_{1}}\left( {e + \frac{\gamma_{1}^{k}}{\mu_{1}} - {Y\left( {{Xw}^{k + 1} + {b^{k + 1}e}} \right)}} \right)} \right.,$ wherein

${S_{\lambda}(\omega)} = \left\{ \begin{matrix} {{\omega - \lambda},} & {{\omega > \lambda},} \\ {0,} & {{0 \leq \omega \leq \lambda},} \\ {\omega,} & {{\omega < 0};} \end{matrix} \right.$ updating c from

$\left. c^{k + 1}\leftarrow{T_{\lambda_{1}/\mu_{2}}\left( {\frac{\gamma_{2}^{k}}{\mu_{2}} + w^{k + 1}} \right)} \right.$ wherein T_(λ)(ω)=sgn(ω)max {0,|ω|−λ}, updating γ₁ from γ₁ ^(k+1)←γ₁ ^(k)+μ₁(e−Y(Xw^(k+1)+b^(k+1)e)−a^(k+1)), and updating γ2 from γ₂ ^(k+1)←γ₂ ^(k)+μ₂(w^(k+1)−c^(k+1)), where said updates are repeated until

${\frac{{w^{k + 1} - w^{k}}}{\max\left( {{w^{k}},1} \right)} < ɛ_{tol}},$ wherein ε_(tol) is a predetermined constant.

According to a further aspect of the invention, the linear system

${\begin{pmatrix} {{\left( {\lambda_{2} + \mu_{2}} \right)I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}} & {\mu_{1}N} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} {{X^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}X^{T}{Y\left( {a^{k} - e} \right)}} - \gamma_{2}^{k} + {\mu_{2}c^{k}}} \\ {{e^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}e^{T}{Y\left( {a_{k} - e} \right)}}} \end{pmatrix}$ is solved using a preconditioned conjugate gradient method.

According to a further aspect of the invention, the convex quadratic program solver minimizes a quadratic program ½w^(T)w+ce^(T)ξ with respect to w, b, s, and ξ subject to

$\left\{ {\begin{matrix} {{{{y_{i}\left( {{w^{T}x_{i}} - b} \right)} - s_{i} + \xi_{i}} = 1},} & {{i = 1},\cdots\mspace{14mu},N,} \\ {{s \geq 0},{\xi \geq 0},} & \; \end{matrix},} \right.$ wherein c is a predetermined penalty parameter for ξ, and y_(i) and x_(i) are components of X, Y corresponding to non-zero elements of w.

According to a further aspect of the invention, the convex quadratic program solver minimizes a quadratic program ½α^(T)Qα−e^(T)α that is a dual of ½w^(T)w+ce^(T)ξ, wherein ½α^(T)Qα−e^(T)α is minimized with respect to α subject to

$\left\{ \begin{matrix} {{{y^{T}\alpha} = 0},} & \; \\ {{0 \leq \alpha_{i} \leq c},} & {{i = 1},\ldots\mspace{14mu},N,} \end{matrix}\quad \right.$ and c is a predetermined penalty parameter for ξ.

According to another aspect of the invention, there is provided a method for training a classifier for selecting features in sparse data sets with high feature dimensionality that incorporates prior domain knowledge, including providing a set of data items x and associated labels y, each data item being a vector in R^(m) where m>>1, a set of positive class constraints Bx≦d

w^(T)x+b≧1, wherein BεR^(k) ¹ ^(×m) and dεR^(k) ¹ , and a set of negative class constraints Dx≦g

w^(T)x+b≦−1, wherein DεR^(k) ² ^(×m) and gεR^(k) ² , using Farkas lemma to transform the positive class constraints into B^(T)u+w=0, d^(T)u−b+1≦0, u≧0 and negative class constraints into D^(T)v−w=0, g^(T)v+b+1≦0, v≧0, minimizing a functional

$L = {{\frac{\lambda_{2}}{2}{w}_{2}^{2}} + {\lambda_{1}{c}_{1}} + {\frac{1}{N}{{eT}(a)}_{+}} + {\frac{\rho_{1}}{2}{{{B^{T}u} + w}}_{2}^{2}} + {\rho_{2}(q)}_{+} + {\frac{\rho_{3}}{2}{{{D^{T}v} - w}}_{2}^{2}} + {\rho_{4}(p)}_{+} + {\gamma_{1}^{T}\left( {e - \left( {{\overset{\_}{X}w} + {yb}} \right) - a} \right)} + {\frac{\mu_{1}}{2}{{e - \left( {\overset{\_}{X} + {yb}} \right) - a}}_{2}^{2}} + {\gamma_{2}^{T}\left( {w - c} \right)} + {\frac{\mu_{2}}{2}{{w - c}}_{2}^{2}} + {\gamma_{3}\left( {{d^{T}u} - b + 1 - q} \right)} + {\frac{\mu_{3}}{2}{{{d^{T}u} - b + 1 - q}}_{2}^{2}} + {\gamma_{4}\left( {{g^{T}v} + b + 1 - p} \right)} + {\frac{\mu_{4}}{2}{{{g^{T}v} + b + 1 - p}}_{2}^{2}}}$ to solve for hyperplane w and offset b of a classifier using an alternating direction method of multipliers that successively iteratively approximates w and b, constraint variables u and v, auxiliary variables a, c, p and q, and Lagrangian multipliers γ₁ ^(k+1), γ₂ ^(k+1), γ₃ ^(k+1), γ₄ ^(k+1), γ₅ ^(k+1), and γ₆ ^(k+1) wherein λ₁, λ₂, ρ₁, ρ₂, ρ₃, ρ₄, μ₁, μ₂, μ₃, and μ₄ are predetermined constants, e is a unit vector, and X and Y are respective matrix representations of the data items x and their associated labels y, and providing non-zero elements of the hyperplane vector w, those components of X and Y that correspond to the non-zero elements of the hyperplane vector w, and η_(u) ⁰≡d^(T)u−b+1,η_(v) ⁰≡g^(T)v+b+1 as arguments to a convex quadratic program solver that uses an interior point method to solve for hyperplane vector w and offset b, wherein the hyperplane vector w and offset b define a classifier than can associate each data item x with the correct label y.

According to a further aspect of the invention, updating (w^(k), b^(k)) comprises solving

${\begin{pmatrix} {{\kappa_{1}I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}{\mathbb{e}}^{T}X} & {{\mu_{1}N} + \kappa_{2}} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} r_{w} \\ r_{b} \end{pmatrix}$ where κ₁=λ₂+μ₂+ρ₁+ρ₃, κ₂=μ₃+μ₄, r _(w) =X ^(T) Yγ ₁ ^(k)+μ₁ X ^(T) Y(e−a ^(k))−γ₂ ^(k)+μ₂ c ^(k)+ρ₃ D ^(T) v ^(k)−ρ₁ B ^(T) u ^(k), and r _(b) =e ^(T) Yγ ₁ ^(k)+μ₁ e ^(T) Y(e−a ^(k))+γ₃ ^(k)+μ₃(d ^(T) u ^(k)+1−q ^(k))−γ₄ ^(k)−μ₄(g ^(T) v ^(k)+1−p ^(k)).

According to a further aspect of the invention,

${\begin{pmatrix} {{\kappa_{1}I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}{\mathbb{e}}^{T}X} & {{\mu_{1}N} + \kappa_{2}} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} r_{w} \\ r_{b} \end{pmatrix}$ is solved using a preconditioned conjugate gradient method.

According to a further aspect of the invention, updating constraint variables u and v comprises solving (ρ₁BB^(T)+μ₃dd^(T)+μ₅I)u^(k+1/2)=r_(u), wherein r_(u)−ρ₁Bw^(k+)1+μ₃db^(k+1)+μ₃d(q^(k)−1)−dγ₃ ^(k)+γ₅+μ₅s^(k), and (ρ₃DD^(T)+μ₄gg^(T)μ₆I)v^(k+1)=r_(v), where r_(v)=ρ₃Dw^(k+1)−μ₄gb^(k+1)−gγ₄ ^(k)−μ₄g(1−p^(k))+γ₆+μ₆t^(k), and slack variables s and t are updated according to

$s^{k + 1} = {\max\left( {0,{u^{k + 1} - \frac{\gamma_{5}^{k}}{\mu_{5}}}} \right)}$ and $t^{k + 1} = {{\max\left( {0,{v^{k + 1} - \frac{\gamma_{6}^{k}}{\mu_{6}}}} \right)}.}$

According to a further aspect of the invention, the auxiliary variables a, c, p and q are updated according to

${a^{k + 1} = {S_{\frac{1}{N_{{\mu\;}_{1}}}}\left( {e + \frac{\gamma_{1}^{k}}{\mu_{1}} - {Y\left( {{Xw}^{k + 1} + {b^{k + 1}e}} \right)}} \right)}},{c^{k + 1} = {T_{\frac{\lambda_{1}}{\mu_{2}}}\left( {\frac{\gamma_{2}^{k}}{\mu_{2}} + w^{k + 1}} \right)}},{q^{k + 1} = {S_{\frac{\rho\; 2}{\mu_{3}}}\left( {{d^{T}u^{k}} - b^{k + 1} + 1 + \frac{\gamma_{3}^{k}}{\mu_{3}}} \right)}},{p^{k + 1} = {S_{\frac{\rho\; 4}{\mu_{4}}}\left( {{g^{T}v^{k}} + b^{k + 1} + 1 + \frac{\gamma_{4}^{k}}{\mu_{4}}} \right)}},$ wherein

${S_{\lambda}(\omega)} = \left\{ \begin{matrix} {{\omega - \lambda},} & {{\omega > \lambda},} \\ {0,} & {{0 \leq \omega \leq \lambda},} \\ {\omega,} & {{\omega < 0};} \end{matrix} \right.$ and T_(λ)(ω)=sgn(ω)max {0,|ω|−λ}.

According to a further aspect of the invention, the Lagrangian multipliers γ₁ ^(k+1), γ₂ ^(k+1), γ₃ ^(k+1), γ₄ ^(k+1), γ₅ ^(k+1), and γ₆ ^(k+1) are updated according to γ₁ ^(K+1)←γ₁ ^(K)+μ₁(e−Y(Xw^(k+1)+b^(k+1)e)−a^(k+1)), γ₂ ^(K+1)←γ₂ ^(K)+μ₂(w^(k+1)−c^(k+1)), γ₃ ^(K+1)←γ₃ ^(K)+μ₃(d^(T)u^(k+1)−b^(k+1)−q^(k+1)+1), γ₄ ^(K+1)←γ₄ ^(K)+μ₄(g^(T)v^(k+1)+b^(k+1)−p^(k+1)+1), γ₅ ^(K+1)←γ₅ ^(K)+μ₅(s^(k+1)−u^(k+1)), and γ₆ ^(K+1)←γ₆ ^(K)+μ₆(t^(k+1)−v^(k+1)).

According to a further aspect of the invention, the convex quadratic program solver minimizes a quadratic program ½β^(T)Hβ+f^(T)β with respect to β subject to

$\quad\left\{ \begin{matrix} {{a_{l} \leq {A\;\beta} \leq a_{u}},} \\ {{x_{l} \leq \beta \leq x_{u}},} \end{matrix} \right.$ wherein β=(w, b, ξ, u, v, η_(u), η_(v))εR^(M), M=m+n+k₁+k₂+3, f=(0, 0, ce^(T), 0, 0, ρ₂, ρ₄),

$H = \begin{pmatrix} {\left( {1 + \rho_{1} + \rho_{3}} \right)I_{m}} & 0 & {\rho_{1}B^{T}} & {{- \rho_{3}}D^{T}} & 0 \\ \; & 0 & \; & \; & \; \\ {\rho_{1}B} & 0 & {\rho_{1}{BB}^{T}} & \; & 0 \\ {{- \rho_{3}}D} & 0 & \; & {\rho_{3}{DD}^{T}} & 0 \\ \; & 0 & \; & \; & \; \end{pmatrix}$ ${A = \begin{pmatrix} \overset{\_}{X} & y & I_{n} & 0 & 0 & 0 & 0 \\ 0 & {- 1} & 0 & d^{T} & 0 & {- 1} & 0 \\ 0 & 1 & 0 & 0 & g^{T} & 0 & {- 1} \end{pmatrix}},{a_{l} = \begin{pmatrix} e \\ 0 \\ 0 \end{pmatrix}},{a_{u} = \begin{pmatrix} 0 \\ {- 1} \\ {- 1} \end{pmatrix}},$ and the values of x₁, x_(u)=±∞, respectively, for w, and x₁=0, x_(u)=∞ for the other components of β.

According to another aspect of the invention, there is provided a non-transitory program storage device readable by a computer, tangibly embodying a program of instructions executed by the computer to perform the method steps for training a classifier for selecting features in sparse data sets with high feature dimensionality that incorporates prior domain knowledge

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of the general form of an alternating directions method of multipliers (ADMM) algorithm, according to an embodiment of the invention.

FIG. 2 is a flowchart of an ADMM algorithm for solving an elastic net support vector machine (ENSVM), according to an embodiment of the invention.

FIG. 3 is a flowchart of a hybrid interior point and alternating direction algorithm for bottom-level training of a classifier, according to an embodiment of the invention.

FIGS. 4A-B is a flowchart of an ADMM algorithm for solving an elastic net support vector machine (ENSVM) that incorporates domain knowledge, according to an embodiment of the invention.

FIG. 5 is a flowchart of a two-phase algorithm for an elastic-net KSVM that incorporates domain knowledge, according to an embodiment of the invention.

FIG. 6 shows a 2D example that illustrates the negative effect of under-sampling on SVM, according to an embodiment of the invention.

FIG. 7 is a table that summarizes the attributes of the synthetic data sets, according to an embodiment of the invention.

FIG. 8 is a table that presents a comparison of prediction accuracies of HIPAD, ADMM, and LIBSVM on the synthetic data, according to an embodiment of the invention.

FIG. 9 is a table that summarizes the real data sets, according to an embodiment of the invention.

FIG. 10 is a table that summarizes the experiment results for the real data sets, according to an embodiment of the invention.

FIG. 11 is a table that summarizes the experimental results for a HIPAD-ENK according to an embodiment of the invention and a ADMM-ENKSVM according to an embodiment of the invention on synthetic data are presented.

FIG. 12 depicts plots of various attributes of an ADMM-ENK according to an embodiment of the invention against iterations.

FIG. 13 (top) is a plot of the solution w returned by a first phase of a HIPAD-ENK according to an embodiment of the invention on the data set ksvm-s-10k.

FIG. 13 (bottom) is a plot of the solution w returned by a HIPAD-ENK according to an embodiment of the invention on the data set ksvm-s-10k when d=g=0

FIG. 14 is a block diagram of an exemplary computer system for implementing a hybrid interior-point alternating directions algorithm for support vector machines and feature selection, according to an embodiment of the invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

Exemplary embodiments of the invention as described herein generally include systems and methods for hybrid interior-point alternating directions for support vector machines and feature selection. Accordingly, while the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit the invention to the particular forms disclosed, but on the contrary, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Mathematical Notation: All matrices are expressed in uppercase Latin letters. Lowercase Latin letters in bold denote vectors, and those in plain typeface denote scalars. Greek letters can be either vectors or scalars depending on the context. The sample-label pairs are denoted by (x_(i), y_(i)), i=1, . . . , N, where N is the number of samples, and x_(i)εR^(m). x_(i) is the i-th entry in a vector x.

1. A Two-Phase Hybrid Algorithm

According to an embodiment of the invention, for data sets with many features, the high dimensionality feature space still poses a computational challenge for the interior-point methods (IPM). Fortunately, many data sets of this kind, such as text documents, are sparse, and the resulting classifier w is also expected to be sparse, i.e. only a small subset of the features are expected to carry significant weights in classification. Obviously, IPM should train a classifier only on the most important features. According to an embodiment of the invention, a two-phase algorithm can appropriately shrink the feature space to leverage the high accuracy of IPM while maintaining efficiency. Specifically, a first phase of an according to an embodiment of the invention can solve an elastic-net support vector machine (ENSVM) or doubly-regularized SVM (DrSVM). The elastic-net regularization performs feature selection with a grouping effect and has been shown to be effective on data sets with many but sparse features and high local correlations. This is the case for text classification data, microarray gene expression, and fMRI data sets. The support of the weight vector w usually stabilizes well before the algorithm converges to the optimal solution. Taking advantage of that stabilization, the first phase of the hybrid algorithm can be terminated early and a second phase can solve a standard SVM with the reduced feature set using a solver that implements an IPM.

1.1. ENSVM and ADMM

A support vector machine (SVM) can be written in the regularized regression form as:

$\begin{matrix} {{\min\limits_{w,b}{\frac{1}{N}{\sum\limits_{i = 1}^{N}\;\left( {1 - \left( {y_{i}\left( {{x_{i}^{T}w} + b} \right)} \right)} \right)_{+}}}} + {\frac{\lambda}{2}{w}_{2}^{2}}} & (1) \end{matrix}$ where the first term is an averaged sum of the hinge losses and the second term can be viewed as a ridge regularization on w. One may see from this form that a standard SVM does not enforce sparsity in the solution, and w is expected to be completely dense. The ENSVM adds an L₁ regularization on top of the ridge regularization term, giving:

$\begin{matrix} {{\min\limits_{w,b}{\frac{1}{N}{\sum\limits_{i = 1}^{N}\;\left( {1 - \left( {y_{i}\left( {{x_{i}^{T}w} + b} \right)} \right)} \right)_{+}}}} + {\lambda_{1}{w}_{1}} + {\frac{\lambda_{2}}{2}{w}_{2}^{2}}} & (2) \end{matrix}$ The resulting regularization

${\lambda_{1}{w}_{1}} + {\frac{\lambda_{2}}{2}{w}_{2}^{2}}$ is the same as that used in an elastic-net regression, which can select highly correlated features in groups (i.e. the grouping effect) while still enforcing sparsity in the solution. This is a useful feature for classification of text documents, which is common in the hierarchical classification setting. Adopting elastic-net regularization as in EQ. (2) yields the same benefit to SVM for training classifiers.

To approximately solve EQ. (2), embodiments of the invention use hybrid methods that efficiently combine an alternating directions method of multipliers (ADMM) with an IPM. ADMM is an extreme case of the inexact augmented Lagrangian (IAL) method for the structured unconstrained system

$\begin{matrix} {{\min\limits_{x}{F(x)}} \equiv {{f(x)} + {g({Ax})}}} & (3) \end{matrix}$ where both functions f( ) and g( ) are convex. These two functions can be decoupled by introducing an auxiliary variable y to convert EQ. (3) into an equivalent constrained formulation:

$\begin{matrix} \begin{matrix} \min\limits_{x,y} & {{f(x)} + {g(y)}} \\ {s.t.} & {{Ax} = {y.}} \end{matrix} & (4) \end{matrix}$ This technique is often called variable-splitting. Here, x and y are used as generic variables. An IAL method minimizes in each iteration the augmented Lagrangian of EQ. (4)

$\begin{matrix} {{L\left( {x,y,\gamma} \right)}:={{f(x)} + {g(y)} + {\gamma^{T}\left( {y - {AX}} \right)} + {\frac{\mu}{2}{{{Ax} - y}}_{2}^{2}}}} & (5) \end{matrix}$ approximately, followed by an update to the Lagrange multiplier γ←γ−μ(Ax−y). The IAL method is guaranteed to converge to the optimal solution of EQ. (3), as long as the augmented Lagrangian can be approximately minimized with an increasing accuracy. ADMM can be viewed as a practical implementation of IAL, where the task of approximately minimizing L(x, y, γ) is performed with respect to x and y alternating once. The general form of an ADMM according to an embodiment of the invention is stated in Algorithm 1, with reference to the steps of the flowchart of FIG. 1:

Algorithm 1 ADMM   1 Choose γ⁰. (Step 11) 2 for k = 0, 1, . . . , K do 3  x^(k+1) ← arg min_(x) L(x, y^(k), γ^(k)) (Step 12) 4  y^(k+1) ← arg min_(y) L(x_(k+1), y, γ^(k)) (Step 13) 5  γ^(k+1) ← γ^(k) − μ(Ax^(k+1) − y^(k+1)) (Step 14) 6 end for (Step 15) 7 return y^(K) (Step 16) It can be shown that ADMM converges for the case of two-way splitting as in Algorithm 1: Theorem 1. Consider EQ. (3), where both f and g are proper, closed, convex functions, and AεR^(n×m) has full column rank. Then, starting with an arbitrary μ>0 and x⁰, y⁰εR^(m), the sequence {x^(k)} generated by Algorithm 1.1 converges to a solution x* of EQ. (3), if EQ. (3) has a solution. If EQ. (3) does not have a solution, at least one of the sequences {x^(k)} and {y^(k)} diverges.

A variable-splitting and ADMM according to an embodiment of the invention applied to EQ. (2) uses auxiliary variables (a, c) and linear constraints so that the non-smooth hinge loss and L₁-norm in the objectives function are de-coupled, allowing optimization over each single auxiliary variable. Specifically, EQ. (2) can be transformed into an equivalent form:

$\begin{matrix} {{{\min\limits_{w,b,a,c}{\frac{1}{N}{\sum\limits_{i = 1}^{N}\; a_{i}}}} + {\lambda_{1}{c}_{1}} + {\frac{\lambda_{2}}{2}{w}_{2}^{2}}}{{{s.t.\mspace{14mu} a} = {e - {Y\left( {{Xw} + {be}} \right)}}},\mspace{50mu}{c - {w.}}}} & (5) \end{matrix}$ The augmented Lagrangian,

$\begin{matrix} {{L\left( {w,b,a,c,\gamma_{1},\gamma_{2}} \right)}:={{\frac{1}{N}{\sum\limits_{i = 1}^{N}\; a_{i}}} + {\lambda_{1}{c}_{1}} + {\frac{\lambda_{2}}{2}{w}_{2}^{2}} + {\gamma_{1}^{T}\left( {e - {Y\left( {{Xw} + {be}} \right)} - a} \right)} + {\gamma_{2}^{T}\left( {w - c} \right)} + {\frac{\mu_{1}}{2}{{e - {Y\left( {{Xw} + {be}} \right)} - a}}_{2}^{2}} + {\frac{\mu_{2}}{2}{{w - c}}_{2}^{2}}}} & (6) \end{matrix}$ is then alternately minimized with respect to (w, b), a, and c in each iteration, followed by an update to the Lagrange multipliers γ₁ and γ₂. The original formulation is thus decomposed into three subsystems that include computing the proximal operator of the hinge loss function (with respect to a), solving a special linear system (with respect to (w, b)), and performing a soft-thresholding operation (with respect to c), which can all be performed efficiently. In particular, the linear system at Line 3 of Algorithm 1 can be solved by a small number of preconditioned conjugate gradient (PCG) iterations, however, other methods as are known in the art may also be used, such as a Cholesky Decomposition or an LU factorization.

Before presenting a two-phase algorithm according to an embodiment of the invention for bottom-level training in Algorithm 2, the following operators are defined:

${S_{\lambda}(\omega)} = \left\{ {{\begin{matrix} {{\omega - \lambda},} & {{\omega > \lambda};} \\ {0,} & {{0 \leq \omega \leq \lambda};} \\ {\omega,} & {\omega < 0.} \end{matrix}\left( {{Proximal}\mspace{14mu}{operator}\mspace{14mu}{associated}\mspace{14mu}{with}\mspace{14mu}{hinge}\mspace{14mu}{loss}} \right){T_{\lambda}(\omega)}} = {{{sgn}(\omega)}\max{\left\{ {0,{{\omega } - \lambda}} \right\}.\left( {{Shrinkage}\mspace{14mu}{operator}} \right)}}} \right.$ Both of these operators apply component-wise to the input vector w.

An exemplary, non-limiting alternating direction algorithm for training an elastic-net support vector machine is as follows, with reference to the steps in the flowchart of FIG. 2.

Algorithm 2: ADMM-ENSVM 1. Input: w⁰, b⁰, a⁰, c⁰, γ₁ ⁰, γ₂ ⁰ 2. for k = 0, 1, . . ., K − 1 do 3.  (w^(k+1), b^(k+1)) ← PCG solution of the structured linear system ${\begin{pmatrix} {{\left( {\lambda_{2} + \mu_{2}} \right)I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}X} & {\mu_{1}N} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} {{X^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}X^{T}{Y\left( {a^{k} - e} \right)}} - \gamma_{2}^{k} + {\mu_{2}c^{k}}} \\ {{e^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}e^{T}{Y\left( {a_{k} - e} \right)}}} \end{pmatrix}$ (step 20 of FIG. 2) 4.   $\left. a^{k + 1}\leftarrow{S_{1\text{/}N\;\mu_{1}}\left( {e + \frac{\gamma_{1}^{k}}{\mu_{1}} - {Y\left( {{Xw}^{k + 1} + {b^{k + 1}e}} \right)}} \right)} \right.$ (step 21 of FIG. 2) 5.   $\left. c^{k + 1}\leftarrow{T_{\lambda_{1}\text{/}\mu_{2}}\left( {\frac{\gamma_{2}^{k}}{\mu_{2}} + w^{k + 1}} \right)} \right.$ (step 22 of FIG. 2) 6.  γ₁ ^(k+1) ← γ₁ ^(k) + μ₁(e − Y(Xw^(k+1) + b^(k+1)e) − a^(k+1)) (step 23 of FIG. 2) 7.  γ₂ ^(k+1) ← γ₂ ^(k) + μ₂(w^(k+1) − c^(k+1)) (step 24 of FIG. 2) 8. end for (step 25 of FIG. 2) 9. return (w^(K−1), b^(K−1)) (step 26 of FIG. 2)

In line 3, above, the N appearing in the lower right of the matrix on the left-hand side of the linear system is a matrix with N's in the main diagonal, where N is the number of samples. Note that Theorem 1 does not apply to a general k-wise update of the variables, where k>2. The above application of ADMM to ENSVM according to an embodiment of the invention may first appear to employ a three-split of the variables, but the updates of a and c are actually independent from each other, given (w, b). Hence, these two updates can be viewed as a single update, performed simultaneously, leading to a two-way update of the variables (w, b) and (a, c), and Theorem 1 still applies in this case.

1.2 A Two-Phase Algorithm

A first phase of an algorithm according to an embodiment of the invention can appropriately reduce the feature space dimensionality without impacting the final prediction accuracy. As discussed above, the suitability of ADMM for the first phase depends on the support of the feature vector quickly converges. Here, this is empirically demonstrated.

An exemplary data set has 50 samples with 300 features each. An ADMM according to an embodiment of the invention converged in 558 iterations. The output classifier w contains only 13 non-zero features, and the feature support converged in less than 50 iteration. Hence, one should monitor the change in the signs and indices of the support and terminate the first phase promptly. An embodiment of the invention uses the following criterion to monitor the relative change in the iterates as a surrogate of the change in the support:

$\begin{matrix} {\frac{{w^{k + 1} - w^{k}}}{\max\left( {{w^{k}},1} \right)} < ɛ_{tol}} & (7) \end{matrix}$ Thus, in the algorithm of FIG. 2, the loop over k would break when the condition of EQ. (7) is satisfied. Experiments of embodiments of the invention showed that when the change over the iterates is small, the evolution of the support indices also stabilizes.

Upon starting a second phase of an algorithm according to an embodiment of the invention, the IPM should warm-start from the corresponding sub-vector of the ADMM solution. IPM is also applied during the second phase to solve the standard SVM using an L₂-regularization instead of the ENSVM used in the first phase. Although ENSVM can be reformulated as a quadratic program (QP), the size of the system is larger than an SVM due to the additional linear constraints introduced by the L₁-norm. In addition, since the feature support has been at least approximately identified in the first phase of the algorithm, enforcing sparsity in the reduced feature space becomes less critical.

An exemplary, non-limiting hybrid interior point and alternating direction algorithm for bottom-level training of a classifier is as follows, with reference to the steps in the flowchart of FIG. 3.

Algorithm 3: A Hybrid Interior Point and Alternating Direction method 1. Input: w⁰, b⁰, a⁰, c⁰, γ₁ ⁰ , γ₂ ⁰ (Step 31) 2. // PHASE 1: ADMM for ENSVM 3. (w^(ADMM), b^(ADMM)) ← ADMM-ENSVM(w⁰, b⁰, a⁰, c⁰, γ₁ ⁰, γ₂ ⁰) (Step 32) 4. // PHASE 2: IPM for SVM 5. {tilde over (w)} ← non-zero components of w^(ADMM) (Step 33) 6. ({tilde over (X)}, {tilde over (Y)})← sub-matrices of (X, Y) corresponding to the (Step 34) support of w^(K) 7. (w, b) ← SVM-IPM({tilde over (X)}, {tilde over (Y)}, {tilde over (w)}, b) (Step 35) 8. return (w, b) (Step 36)

Note that in embodiments of the invention, the ADMM-ENSVM of Phase 1 will use the stopping criteria of EQ. (7) rather than looping over all k. According to embodiments of the invention, the system being solved in Phase 2 of Algorithm 3 is much smaller than the system being solved in Phase 1. An IPM according to an embodiment of the invention uses second order derivative information, so that it can converge much faster than a method using first order derivatives. However, an IPM needs to store a Hessian matrix of second derivatives, which requires n² storage elements for an output vector of n elements. If the full vector output by Phase 1 is passed to an IPM solver according to an embodiment of the invention, the storage required for the Hessian matrix by the IPM solver would be prohibitive. Also, the solution of the Newton system of equations would be time consuming (order n² in the number of features). However, the number of non-zero elements of the w^(ADMM) output by Phase 1 would typically be several orders of magnitude less than the full vector, and thus an IPM solver according to an embodiment of the invention would be able to handle the storage requirements for the non-zero elements.

2. SVM Via Interior-Point Method

An exemplary, non-limiting method according to an embodiment of the invention for solving an SVM using IPM makes use of an open source, publicly available QP solver known as OOQP, a software package for convex quadratic programming that is disclosed in E. Gertz, S. Wright, “Object-oriented software for quadratic programming”, ACM Transactions on Mathematical Software (TOMS), 29(1):58-81, 2003, the contents of which are herein incorporated by reference in their entirety. The OOQP Package is an object-oriented implementation of a primal-dual interior-point algorithm in C++ for the general convex QP system:

$\begin{matrix} \begin{matrix} \min\limits_{\beta} & {{{\frac{1}{2}\beta^{T}H\;\beta} + {f^{T}\beta}},} \\ \; & {{a_{l} \leq {A\;\beta} \leq a_{u}},} \\ \; & {{{A_{eq}\beta} = a_{eq}},} \\ \; & {x_{l} \leq \beta \leq {x_{u}.}} \end{matrix} & (8) \end{matrix}$ In a HIPAD method according to an embodiment of the invention, the SVM is formulated as a linearly equality-constrained convex QP. Hence, the first set of constraints in EQ. (8) can be ignored.

The primal formulation of an SVM (PSVM) according to an embodiment of the invention is

$\begin{matrix} \begin{matrix} \min\limits_{w,b,s,\xi} & {{\frac{1}{2}w^{T}w} + {{ce}^{T}\xi}} \\ {s.t.} & {{{{y_{i}\left( {{w^{T}x_{i}} - b} \right)} - s_{i} + \xi_{i}} = 1},\mspace{14mu}{i = 1},\ldots\mspace{14mu},N,} \\ \; & {{s \geq 0},{\xi \geq 0},} \end{matrix} & (9) \end{matrix}$ where c is a user-defined penalty parameter for ξ. The Hessian H from EQ. (8) is simply an augmented identity padded with zeros at the diagonal entries corresponding to (b, ξ, s).

The dual (DSVM) of the PSVM of EQ. (9) is

$\begin{matrix} \begin{matrix} \min\limits_{\alpha} & {{\frac{1}{2}\alpha^{T}Q\;\alpha} - {e^{T}\alpha}} \\ {s.t.} & {{{y^{T}\alpha} = 0},} \\ \; & {{0 \leq \alpha_{i} \leq c},\mspace{14mu}{i = 1},\ldots\mspace{14mu},N,} \end{matrix} & (10) \end{matrix}$ where Q_(ij)=y_(i)y_(j)x_(i) ^(T)x_(j)= XX ^(T). In the notation of EQ. (8), H=Q, A_(eq)=y^(T), and α_(eq)=0. The optimal solution, by considering the KKT conditions of EQS. (9) and (10) is given by

${w = {{{\overset{\_}{X}}^{T}\alpha} = {\sum\limits_{i \in {SV}}\;{\alpha_{i}y_{i}x_{i}}}}},$ where SV is the set of sample indices corresponding to the support vectors. The optimal bias term b can be computed from the complementary slackness condition α_(i)(y _(i)(x _(i) ^(T) w+b)−1+ξ_(i))

Whether to solve EQ. (9) (SVM-P) or EQ. (10) (SVM-D) for a given data set depends on its dimensions as well as sparsity. If X is a sparse matrix, Q in (SVM-D) is still likely to be dense, whereas the Hessian matrix in (SVM-P) is the identity. The primal formulation (SVM-P), however, has a larger variable dimension and more constraints. In the context of the classification tasks considered herein, since IPM is used in the second phase, the number of non-zero features (m) is relatively small, and the number of observations (N) may be assumed small too. When m<N or X is sparse, embodiments of the invention may solve (SVM-P). For dense systems with m≧N, embodiments of the invention may solve (SVM-D), since computing Q= XX ^(T) is inexpensive and Q is only of N×N. In experiments of embodiments of the invention, (SVM-D) was solved for all the data sets for simplicity, however, those of skill in the art will readily recognize that (SVM-P) could also be solved in other embodiments.

3. Domain Knowledge Incorporation

Often, one has prior domain knowledge for specific classification tasks. Domain knowledge is most helpful when the training data does not form a comprehensive representation of the underlying unknown population, resulting in poor generalization performance of SVM on the unseen data from the same population. This often arises in situations where labeled training samples are scarce, while there is an abundance of unlabeled data. A simple two-dimensional example is used to illustrate this idea in FIG. 6. The solid line 61 in FIG. 6 is the SVM solution based on the training data, while the dotted line 62 is the solution of SVM trained on the entire population.

The optimal separating hyperplane obtained from SVM for the training points 63, 64 is clearly the line with equal distance to the two axes, the solid line 61 in FIG. 6. However, both test points 65, 66 lie on the ‘wrong’ side of the hyperplane because the true hyperplane for the entire population is quite different from the SVM solution as illustrated by the dotted line 62 in FIG. 6.

For high dimensional data, ENSVM performs feature selection during training to produce a simpler model and to achieve better prediction accuracy. However, the quality of the feature selection depends entirely on the training data. In cases like the one described above, the feature support identified by ENSVM may not form a good representation of the population. Hence, when domain knowledge about certain features is available, it should be taken into consideration during the training phase and the relevant features in the feature support should then be deemed important for classification.

Embodiments of the invention can incorporate domain knowledge in the form of class-membership information associated with features. In the context of the previous example, such knowledge could be implication relationships as follows: Every data sample x with feature x₂ greater than 2 should be classified as negative, and every sample x which lies on the negative side of the hyperplane

${f = \begin{pmatrix} 1.0 \\ {- 1.55} \end{pmatrix}},{f_{0} = {- 0.9}},{{{{i.e.\mspace{14mu} f^{T}}x} + f_{0}} \leq 0},$ should be classified to the positive class. Such information can be incorporated (or such rules be enforced) in a SVM by adding equivalent linear constraints to the SVM QP formulation (KSVM). To be specific, the above information can be modeled with the linear implication Bx≦d

w ^(T) x+b≧1,  (11) where BεR^(k) ¹ ^(×m) and dεR^(k) ¹ where k₁ is the number of positive class constraints. It can be shown that by utilizing the nonhomogeneous Farkas theorem of the alternative, EQ. (11) can be transformed into the following equivalent system of linear inequalities with a solution u: B ^(T) u+w=0,d ^(T) u−b+1≦0,u≧0  (12) For the sake of completeness, embodiments of the invention can consider the following linear implication for the negative class membership: Dx≦g

wTx+b≦−1,DεR ^(k) ² ^(×m) ,gεR ^(k) ² ,  (13) where k₂ is the number of negative class constraints, which can be similarly represented by the set of linear equations in v: D ^(T) v−w=0,g ^(T) v+b+1≦0,v≧0.  (14) Hence, to incorporate the domain knowledge represented by EQS. (11) and (13) into an SVM according to an embodiment of the invention, one can add the linear constraints of EQS. (12) and (14) to EQ. (9) (SVM-P):

$\begin{matrix} \begin{matrix} \min\limits_{w,b,s,\xi} & {{\frac{1}{2}w^{T}w} + {c\; e^{T}\xi} + {\frac{\rho_{1}}{2}{{{B^{T}u} + w}}_{1}} + {\frac{\rho_{2}}{2}{{{D^{T}v} - w}}_{1}}} \\ {s.t.} & {{{{y_{i}\left( {{w^{T}x_{i}} - b} \right)} - s_{i} + \xi_{i}} = 1},\mspace{14mu}{i = 1},\ldots\mspace{14mu},N,} \\ \; & {{{{d^{T}u} - b + 1} \leq 0},} \\ \; & {{{{g^{T}v} + b + 1} \leq 0},} \\ \; & {\left( {s,\xi,u,v} \right) \geq 0.} \end{matrix} & (15) \end{matrix}$ If the data from the two classes is not linearly separable with the augmented constraint set, one can modify EQ. (9) (SVM-P) as follows to form a soft-margin formulation:

$\begin{matrix} \begin{matrix} \min\limits_{w,b,\xi,u,v,{\eta\; u},{\eta\; v}} & \begin{matrix} {{\frac{1}{2}w^{T}w} + {{ce}^{T}\xi} + {\frac{\rho_{1}}{2}{{{B^{T}u} + w}}_{1}} +} \\ {{\rho_{2}\eta_{u}} + {\frac{\rho_{3}}{2}{{{D^{T}v} - w}}_{1}} + {\rho_{4}\eta_{v}}} \end{matrix} \\ {s.t.} & {{{{y_{i}\left( {{w^{T}x_{i}} + b} \right)} \geq {1 - \xi_{i}}} = 1},\mspace{14mu}{i = 1},\ldots\mspace{14mu},N,} \\ \; & {{{{d^{T}u} - b + 1} \leq \eta_{u}},} \\ \; & {{{{g^{T}v} + b + 1} \leq \eta_{v}},} \\ \; & {\left( {\xi,u,v,\eta_{u},\eta_{v}} \right) \geq 0.} \end{matrix} & (16) \end{matrix}$ The two L₁-norm terms in the objective function come from the relaxation of the equality constraints, (e.g. B^(T)u+w=0) to the corresponding box constraints (e.g. −z_(u)≦B^(T)u+w≦z_(u)). Compared to EQ. (9) (SVM-P), this formulation, with the L₁ terms expressed as box constraints, increases both the variable dimension and the number of linear constraints by at least 2m. This is clearly undesirable when m is large, which is the scenario considered herein.

Note that if the quadratics ∥B^(T)u+w∥₂ ² and ∥D^(T)v−w∥₂ ² penalized instead of their L₁ counterparts, the above optimization formulation is a smaller convex QP. This is due to the non-differentiability of the L₁ norm at zero, since it is a sum of absolute values, which effectively doubles the number of constraints. Hence, embodiments of the invention consider the following primal formulation for domain knowledge incorporation (KSVM-P):

$\begin{matrix} \begin{matrix} \min\limits_{w,b,\xi,u,v,{\eta\; u},{\eta\; v}} & \begin{matrix} {{\frac{1}{2}w^{T}w} + {{ce}^{T}\xi} + {\frac{\rho_{1}}{2}{{{B^{T}u} + w}}_{2}^{2}} +} \\ {{\rho_{2}\eta_{u}} + {\frac{\rho_{3}}{2}{{{D^{T}v} - w}}_{2}^{2}} + {\rho_{4}\eta_{v}}} \end{matrix} \\ {s.t.} & {{{{y_{i}\left( {{w^{T}x_{i}} + b} \right)} \geq {1 - \xi_{i}}} = 1},\mspace{14mu}{i = 1},\ldots\mspace{14mu},N,} \\ \; & {{{{d^{T}u} - b + 1} \leq \eta_{u}},} \\ \; & {{{{g^{T}v} + b + 1} \leq \eta_{v}},} \\ \; & {\left( {\xi,u,v,\eta_{u},\eta_{v}} \right) \geq 0.} \end{matrix} & (17) \end{matrix}$ Embodiments of the invention can combine ENSVM and KSVM, and embodiments of the invention can solve the combined system in a HIPAD framework. This combination can exploit domain knowledge to improve the feature selection, and hence, the generalization performance of HIPAD. 3.1 ADMM Phase

Embodiments of the invention can use ADMM to solve an elastic-net SVM that incorporates domain knowledge. First, EQS. (2) and (17) can be combined, and the resulting optimization formulation can be written in an equivalent unconstrained form by penalizing the violation of the inequality constraints through hinge losses in the objective function (ENK-SVM):

$\begin{matrix} {\mspace{79mu}{{\min\limits_{w,b,{u \geq 0},{v \geq 0}}{F\left( {w,b,u,v} \right)}}\mspace{20mu}{where}{{F\left( {w,b,u,v} \right)} \equiv {{\frac{\lambda_{2}}{2}{w}_{2}^{2}} + {\lambda_{1}{w}_{1}} + {\frac{1}{N}{\sum\limits_{i = 1}^{N}\;{{\left( {1 - {y_{i}\left( {{x_{i}^{T}w} + b} \right)}} \right)++}\frac{\rho_{1}}{2}{{{B^{T}u} + w}}_{2}^{2}}}} + {{{\rho_{2}\left( {{d^{T}u} - b + 1} \right)}++}\frac{\rho_{3}}{2}{{{D^{T}v} - w}}_{2}^{2}} + {\rho_{4}\left( {{g^{T}v} + b + 1} \right)} + .}}}} & (18) \end{matrix}$ Variable-splitting can then be applied to decouple the L₁-norms and hinge losses and obtain the following objective function

${F\left( {w,b,u,v,a,c,p,q} \right)} \equiv {{\frac{\lambda_{2}}{2}{w}_{2}^{2}} + {\lambda_{1}{c}_{1}} + {\frac{1}{N}{{e^{T}(a)}++}\frac{\rho_{1}}{2}{{{B^{T}u} + w}}_{2}^{2}} + {{{\rho_{2}(q)}++}\frac{\rho_{3}}{2}{{{D^{T}v} - w}}_{2}^{2}} + {\rho_{4}(p)} +}$ and equivalent constrained optimization formulation:

$\begin{matrix} \begin{matrix} \min\limits_{w,b,u,v,a,c,p,q} & {F\left( {w,b,u,v,a,c,p,q} \right)} \\ {s.t.} & {{c = w},} \\ \; & {{a = {e - \left( {{\overset{\_}{X}w} + {yb}} \right)}},} \\ \; & {{q = {{d^{T}u} - b + 1}},} \\ \; & {{p = {{g^{T}v} + b + 1}},} \\ \; & {{u \geq 0},{v \geq 0.}} \end{matrix} & (19) \end{matrix}$ Embodiments of the invention then form the augmented Lagrangian L of the system of EQ. (19):

$\begin{matrix} {\mathcal{L} = {{F\left( {w,b,u,v,a,c,p,q} \right)} + {\gamma_{1}^{T}\left( {e - \left( {{\overset{\_}{X}w} + {yb}} \right) - a} \right)} + {\frac{\mu_{1}}{2}{{e - \left( {\overset{\_}{X} + {yb}} \right) - a}}_{2}^{2}} + {\gamma_{2}^{T}\left( {w - c} \right)} + {\frac{\mu_{2}}{2}{{w - c}}_{2}^{2}} + {\gamma_{3}\left( {{d^{T}u} - b + 1 - q} \right)} + {\frac{\mu_{3}}{2}{{{d^{T}u} - b + 1 - q}}_{2}^{2}} + {\gamma_{4}\left( {{g^{T}v} + b + 1 - p} \right)} + {\frac{\mu_{4}}{2}{{{g^{T}v} + b + 1 - p}}_{2}^{2}}}} & (20) \end{matrix}$ and minimize L with respect to w, b, c, a, p, q, u, v individually and in order. For the sake of readability, the non-negative constraints are not penalized for u and v in the augmented Lagrangian for the moment.

Given (a^(k), c^(k)p^(k), q^(k)), solving for (w, b) again involves solving a linear system

$\begin{matrix} {{\begin{pmatrix} {{\kappa_{1}I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}X} & {{\mu_{1}N} + \kappa_{2}} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} r_{w} \\ r_{b} \end{pmatrix}} & (21) \end{matrix}$ where κ₁=λ₂+μ₂+ρ₁+ρ₃, κ₂=μ₃+μ₄, r _(w) =X ^(T) Yγ ₁ ^(k)+μ₁ X ^(T) Y(e−a ^(k))−γ₂ ^(k)+μ₂ c ^(k)+ρ₃ D ^(T) v ^(k)−ρ₁ B ^(T) u ^(k), and r _(b) =e ^(T) Yγ ₁ ^(k)+μ₁ e ^(T) Y(e−a ^(k))+γ₃ ^(k)+μ₃(d ^(T) u ^(k)+1−q ^(k))−γ₄ ^(k)−μ₄(g ^(T) v ^(k)+1−p ^(k)). As with the linear system of Algorithm 2, above, the N in the lower right of the matrix on the left-hand side is a matrix with diagonal elements equal to N. Similar to solving the linear system in Algorithm 2, the solution to the above linear system can be computed through a few PCG iterations, taking advantage of the fact that the left-hand-side matrix is of low-rank. Note however, that other methods as are know to those of skill in the art can be used to solve EQ. (21), such as Cholesky decomposition or LU factorization.

To minimize the augmented Lagrangian with respect to u, embodiments of the invention can solve a convex quadratic formulation with non-negative constraints:

$\begin{matrix} {{\min\limits_{u \geq 0}{\frac{\rho_{1}}{2}{{{B^{T}u} + w^{k + 1}}}_{2}^{2}}} + {\gamma_{3}^{k}d^{T}u} + {\frac{\mu_{3}}{2}{{{{d^{T}u} - b^{k + 1} + 1 - q^{k}}}_{2}^{2}.}}} & (22) \end{matrix}$ Obtaining a closed-form solution to the above formulation through a direct approach is challenging. However, by introducing a slack variable s and transferring the non-negative constraint on u to s, the formulation of EQ. (22) can be decomposed into two parts:

$\begin{matrix} {{{\min\limits_{u,{s \geq 0}}{\frac{\rho_{1}}{2}{{{B^{T}u} + w^{k + 1}}}_{2}^{2}}} + {\gamma_{3}^{k}d^{T}u} + {\frac{\mu_{3}}{2}{{{d^{T}u} - b^{k + 1} + 1 - q^{k}}}_{2}^{2}}}{{{s.t.\mspace{14mu} u} - s} = 0.}} & (23) \end{matrix}$ Penalizing the linear constraint u−s=0 in the new augmented Lagrangian, the new subsystem with respect to (u, s) is

$\begin{matrix} {{\min\limits_{u,{s \geq 0}}{\frac{\rho_{1}}{2}{{{B^{T}u} + w^{k + 1}}}_{2}^{2}}} + {\gamma_{3}^{k}d^{T}u} + {\frac{\mu_{3}}{2}{{{d^{T}u} - b^{k + 1} + 1 - q^{k}}}_{2}^{2}} + {\gamma_{5}^{T}\left( {s - u} \right)} + {\frac{\mu_{5}}{2}{{{u - s}}_{2}^{2}.}}} & (24) \end{matrix}$ Given an s^(k)≧0, one can compute u^(k+1) by solving a k₁×k₁ linear system: (ρ₁ BB ^(T)+μ₃ dd ^(T)+μ₅ I)u ^(k+1/2) =r _(u),  (25) where r _(u)=−ρ₁ Bw ^(k+1)+μ₃ db ^(k+1)+μ₃ d(q ^(k)−1)−dγ ₃ ^(k)+γ₅+μ₅ s ^(k) Embodiments of the invention can assume that B is full row-rank. This is a reasonable assumption since otherwise there is at least one redundant domain knowledge constraint which can be removed. The number of domain knowledge constraints k₁ and k₂ are usually small, so EQ. (25) can be solved exactly and efficiently. An exemplary, non-limiting method for solving EQ. (25) uses a Cholesky factorization.

Embodiments of the invention can solve for the s^(k+1) corresponding to u^(k+1) by observing that EQ. (24) is separable in the elements of s. For each element s_(i), an optimal solution to the one-dimensional quadratic formulation with a non-negative constraint on s_(i) is given by

${\max\left( {0,{u_{i} - \frac{\left( \gamma_{5} \right)_{i}}{\mu_{5}}}} \right)},$ which can be written in vector form:

$s^{k + 1} = {{\max\left( {0,{u^{k + 1} - \frac{\gamma_{5}^{k}}{\mu_{5}}}} \right)}.}$

Similarly, embodiments of the invention can solve for v^(k+1) by introducing a non-negative slack variable t and computing (ρ₃ DD ^(T)+μ₄ gg ^(T)+ρ₆ I)v ^(k+1) =r _(v),  (26) where r _(v)=ρ₃ Dw ^(k+1)−μ₄ gb ^(k+1) −gγ ₄ ^(k)−μ₄ g(1−p ^(k))+γ₆+μ₆ t ^(k).

Embodiments of the invention can solve for the t^(k+1) corresponding to v^(k+1) similar to the solution for s:

$t^{k + 1} = {{\max\left( {0,{v^{k + 1} - \frac{\gamma_{6}^{k}}{\mu_{6}}}} \right)}.}$

Now, given (w^(k+1), b^(k+1), u^(k+1), v^(k+1)), the solutions for a and c are the same as in Lines 4 and 5 of Algorithm 2:

$\begin{matrix} {{a^{k + 1} = {{??}_{\frac{1}{N_{\mu_{1}}}}\left( {e + \frac{\gamma_{1}^{k}}{\mu_{1}} - {Y\left( {{X\; w^{k + 1}} + {b^{k + 1}e}} \right)}} \right)}},{c^{k + 1} = {{{??}_{\frac{\lambda_{1}}{\mu_{2}}}\left( {\frac{\gamma_{2}^{k}}{\mu_{2}} + w^{k + 1}} \right)}.}}} & (27) \end{matrix}$

The subsystem with respect to q is

$\begin{matrix} {{{\min\limits_{q}{\rho_{2}(q)}_{+}} - {\gamma_{3}^{k}q} + {\frac{\mu_{3}}{2}{{{d^{T}u^{k}} - b^{k + 1} + 1 - q}}_{2}^{2}}} \equiv {{\rho_{2}(q)}_{+} + {\frac{\mu_{3}}{2}{{{q - \left( {{d^{T}u^{k}} - b^{k + 1} + 1 + \frac{\gamma_{3}^{k}}{\mu_{3}}} \right)}}_{2}^{2}.}}}} & (28) \end{matrix}$ The solution is given by a (one-dimensional) proximal operator associated with the hinge loss:

$\begin{matrix} {q^{k + 1} = {{{??}_{\frac{\rho_{2}}{\mu_{3}}}\left( {{d^{T}u^{k}} - b^{k + 1} + 1 + \frac{\gamma_{3}^{k}}{\mu_{3}}} \right)}.}} & (29) \end{matrix}$

Similarly, the subsystem with respect to p is:

$\begin{matrix} {{{\min\limits_{p}{\rho_{4}(p)}_{+}} - {\gamma_{4}^{k}p} + {\frac{\mu_{4}}{2}{{{g^{T}v^{k}} + b^{k + 1} + 1 - p}}_{2}^{2}}},} & (30) \end{matrix}$ and the solution is given by

$\begin{matrix} {p^{k + 1} = {{{??}_{\frac{\rho_{4}}{\mu_{4}}}\left( {{g^{T}v^{k}} + b^{k + 1} + 1 + \frac{\gamma_{4}^{k}}{\mu_{4}}} \right)}.}} & (31) \end{matrix}$ The above steps are summarized in Algorithm 4, with reference to the steps of the flowchart in FIGS. 4A-B.

Algorithm 4: ADMM-ENK 1. Input: w⁰, b⁰, a⁰, c⁰, u⁰, v⁰, p⁰, q⁰, s⁰ ≧ 0, t⁰ ≧ 0, and (Step 40) the parameters γ_(i) ⁰, λ₁, λ₂, ρ_(i), i = 1, . . . , 6. 2. for k = 0, 1, . . . , K − 1 do 3.  (w^(k+1), b^(k+1)) ← PCG solution of EQ. (21) (Step 41) 4.  u^(k+1) ← solution of EQ. (25) (Step 42) 5.   $\left. s^{k + 1}\leftarrow{\max\left( {0,{u^{k + 1} - \frac{\gamma_{5}^{k}}{\mu_{5}}}} \right)} \right.$ (Step 42) 6.  v^(k+1) ← solution of EQ. (26) (Step 43) 7.   $\left. t^{k + 1}\leftarrow{\max\left( {0,{v^{k + 1} - \frac{\gamma_{6}^{k}}{\mu_{6}}}} \right)} \right.$ (Step 43) 8.   $\left. a^{k + 1}\leftarrow{S_{1\text{/}N\;\mu_{1}}\left( {e + \frac{\gamma_{1}^{k}}{\mu_{1}} - {Y\left( {{Xw}^{k + 1} + {b^{k + 1}e}} \right)}} \right)} \right.$ (Step 44) 9.   $\left. c^{k + 1}\leftarrow{T_{\lambda_{1}\text{/}\mu_{2}}\left( {\frac{\lambda_{2}^{k}}{\mu_{2}} + w^{k + 1}} \right)} \right.$ (Step 45) 10.   $\left. q^{k + 1}\leftarrow{S_{\rho_{2}\text{/}\mu_{3}}\left( {{d^{T}u^{k}} - b^{k + 1} + 1 + \frac{\gamma_{3}^{k}}{\mu_{3}}} \right)} \right.$ (Step 45) 11.   $\left. p^{k + 1}\leftarrow{S_{\rho_{4}\text{/}\mu_{4}}\left( {{g^{T}v^{k}} + b^{k + 1} + 1 + \frac{\gamma_{4}^{k}}{\mu_{4}}} \right)} \right.$ (Step 45) 12.  γ₁ ^(K+1) ← γ₁ ^(K) + μ₁(e − Y(Xw^(k+1) + b^(k+1)e) − a^(k+1)) (Step 46) 13.  γ₂ ^(K+1) ← γ₂ ^(K) + μ₂(w^(k+1) − c^(k+1)) (Step 46) 14.  γ₃ ^(K+1) ← γ₃ ^(K) + μ₃(d^(T)u^(k+1) − b^(k+1) − q^(k+1) + 1) (Step 46) 15.  γ₄ ^(K+1) ← γ₄ ^(K) + μ₄(g^(T)v^(k+1) + b^(k+1) − p^(k+1) + 1) (Step 46) 16.  γ₅ ^(K+1) ← γ₅ ^(K) + μ₅(s^(k+1) − u^(k+1)) (Step 46) 17.  γ₆ ^(K+1) ← γ₆ ^(K) + μ₆(t^(k+1) − v^(k+1)) (Step 46) 18. end for (Step 47) 19. return (w^(K−1), b^(K−1)) (Step 48)

Although there appears to be ten additional parameters (two λ's, four ρ's and four μ's) in the ADMM method for ENK-SVM, embodiments of the invention can set the ρ's to the same value and do the same for the μ's. Hence, in practice, there is only one additional parameter to tune, and experiments described below indicate that an algorithm according to an embodiment of the invention is insensitive to the μ's. In addition, note that due to the presence of additional constraints, the stopping criteria of EQ. (7) should not be used to break out of the loop over k.

3.2 IPM Phase

The Hessian of EQ. (17) (KSVM-P) with respect to the variables β=(w, b, ξ, u, v, η_(u), η_(v))εR^(M), M=m+N+k₁+k₂+3, is a highly sparse matrix:

$\begin{matrix} {H = {\begin{pmatrix} {\left( {1 + \rho_{1} + \rho_{3}} \right)I_{m}} & 0 & {\rho_{1}B^{T}} & {{- \rho_{3}}D^{T}} & 0 \\ \; & 0 & \; & \; & \; \\ {\rho_{1}B} & 0 & {\rho_{1}B\; B^{T}} & \; & 0 \\ {{- \rho_{3}}D} & 0 & \; & {\rho_{3}D\; D^{T}} & 0 \\ \; & 0 & \; & \; & \; \end{pmatrix}.}} & (32) \end{matrix}$ An exemplary, non-limiting solver for solving EQ. (17) is the IPM QP solver available from www.mosek.com. The Mosek solver contains a highly efficient primal-dual IPM implementation that works very well for large sparse systems. To use the solver's interface from a HIPAD algorithm according to an embodiment of the invention, EQ. (17) is cast into the general form:

$\begin{matrix} {{{\min\limits_{\beta}{\frac{1}{2}\beta^{T}H\;\beta}} + {f^{T}\beta}}{{a_{l} \leq {A\;\beta} \leq a_{u}},{x_{l} \leq \beta \leq x_{u}},}} & (33) \end{matrix}$ with H defined above, f=(0, 0, ce^(T), 0, 0, ρ₂, ρ₄), and

$\begin{matrix} {{A = \begin{pmatrix} \overset{\_}{X} & y & I_{n} & 0 & 0 & 0 & 0 \\ 0 & {- 1} & 0 & d^{T} & 0 & {- 1} & 0 \\ 0 & 1 & 0 & 0 & g^{T} & 0 & {- 1} \end{pmatrix}},{a_{l} = \begin{pmatrix} e \\ 0 \\ 0 \end{pmatrix}},{a_{u} = {\begin{pmatrix} 0 \\ {- 1} \\ {- 1} \end{pmatrix}.}}} & (34) \end{matrix}$ The box-constraint limit values x_(l) and x_(u) depend upon the component of β. For the first component, w, the values of x_(l) and x_(u) are ±∞, since the w are unconstrained. For the other components, x_(l)=0 and x_(u)=∞.

A two-phase algorithm for an elastic-net KSVM that incorporates domain knowledge according to an embodiment of the invention is presented in Algorithm 5, with reference to the steps of the flowchart in FIG. 5.

Algorithm 5: HIPAD-ENK 1 Input: w⁰, b⁰, a⁰, c⁰, u⁰, v⁰, p⁰, q⁰, s⁰≧0, t⁰≧0, (Step 51) 2 // PHASE 1: ADMM for ENK-SVM 3 (w, b, u, v) ← (Step 52)  ADMM-ENK (w⁰, b⁰, a⁰, c⁰, u⁰, v⁰, p⁰, q⁰, s⁰ , t⁰) 4 // PHASE 2: IPM for KSVM 5 {tilde over (w)} ← non-zero components of w (Step 53) 6 ({tilde over (X)}, {tilde over (Y)})←sub-matrices of (X, Y) corresponding to the (Step 54) support of w 7 η_(u) ⁰ ← d^(T) u − b + 1 (Step 55) 8 η_(v) ⁰ ← g^(T) v + b + 1 (Step 55) 9 (w, b) ← (Step 56)  SVM-IPM ({tilde over (X)}, {tilde over (Y)}, {tilde over (w)}, b, u, v, η_(u) ⁰, η_(v) ⁰ ) using EQS. (32)and (34) 10 return (w, b) (Step 57) 3.3 Convergence

Recall from Remark 1 that the convergence of ADMM, as stated in Theorem 1, only applies to the case of two-way update of the variables. In a first phase of Algorithm 5, while the updates to a, c, q, p, s and t are independent from each other given (w, b, u, v), the updates to w, b, u, and v are not. However, one may observe from EQS. (21), (25), and (26) that the new iterate of (w, b, u, v) can be computed by solving an augmented linear system in (w, b, u, v) similar to Line 3 in Algorithm 4. With this modification, Algorithm 4 essentially employs a two-way update of the variables (w, b, u, v) and (a, c, p, q). Hence, convergence is guaranteed by Theorem 1. In practice, a sequential updating scheme for (w, b, u, v) is maintained.

4. Experiments

An exemplary, non-limiting HIPAD algorithm according to an embodiment of the invention can be written in C/C++, and use the CBLAS interface and the BLAS implementation provided by the GNU Scientific Library for basic linear algebra routines, and where code for the ADMM phase is based on that from G. Ye, Y. Chen, and X. Xie, “Efficient variable selection in support vector machines via the alternating direction method of multipliers”, AISTAT 2010, 2010, the contents of which are herein incorporated by reference in their entirety. Algorithm according to embodiments of the invention were tested on both synthetic and real data and compared with the LIBSVM disclosed in R. Fan, P. Chen, and C. Lin, “Working set selection using second order information for training support vector machines”, The Journal of Machine Learning Research, 6:1889-1918, 2005, the contents of which are herein incorporated by reference in their entirety, and ADMM-ENSVM. Since LIBSVM does not support ENSVM, it was used to train a standard SVM. An exemplary, non-limiting HIPAD-ENK according to an embodiment of the invention was written in Matlab, and the Mosek QP IPM solver was used for an IPM for KSVM phase according to an embodiment of the invention through the solver's Matlab interface.

The transition condition at the end of a first phase of a HIPAD according to an embodiment of the invention is specified in EQ. (7), with ε_(tol)=10⁻². The stopping criteria for ADMM-ENSVM are as follows:

${\frac{{F^{k + 1} - F^{k}}}{\max\left\{ {1,{F^{k}}} \right\}} \leq \varepsilon_{1}},{{{a - \left( {e - {\overset{\_}{X}\; w} - {y\; b}} \right)}}_{2} \leq \varepsilon_{1}},{{{c - w}}_{2} \leq \varepsilon_{1}},{\frac{{{w^{k + 1} - w}}_{2}}{{w^{k}}_{2}} \leq \varepsilon_{2}},$ where F^(k) is the objective value at the k-th iteration, ε₁=10⁻⁵, and ε₂=10⁻³. 4.1 Synthetic Data

Synthetic data sets were generated from real data sets as follows. For each real data set, the feature space was augmented to a dimensionality of M=10,000 (or 20,000). The original m features occupy the first m entries in the feature vector. The remaining features were generated as independent and identically distributed (i.i.d.) Gaussian. Table 1, shown in FIG. 7, summarizes the attributes of the synthetic data sets. In the table, N₁=the training set size, and N₂=the test set size. The numbers in parentheses are in the augmented feature dimensions. The prediction accuracy of the algorithms did not appear sensitive to λ₂ and c. For simplicity, embodiments set λ₂=1.0 and c=10 for a HIPAD according to an embodiment of the invention and performed cross validation (CV) to select the best values of λ₁ for HIPAD and ADMM-ENSVM. For LIBSVM, c=1. Table 2, shown in FIG. 8, presents a comparison of prediction accuracies of HIPAD, ADMM, and LIBSVM on the synthetic data, with LIBSVM as the baseline for prediction performance. The entries (x/y) in the columns Support Size indicates that x out of y features are the original features. The last column shows the prediction accuracies obtained by LIBSVM on the original real data sets.

This set of data simulates the presence of noisy features. The prediction accuracy of LIBSVM, which is a standard SVM, falls because of the interference of the noise. In contrast, the results show that a HIPAD according to an embodiment of the invention is robust to noisy features in that it can select the most important true features and eliminate the noisy features, producing the same, or even better, prediction accuracies as those on the original data, as shown in the last column in Table 2. A HIPAD according to an embodiment of the invention is also faster than ADMM-ENSVM on these dense data sets.

4.2. Real Data

A HIPAD algorithm according to an embodiment of the invention was tested on nine real data sets which are publicly available. Table 3, shown in FIG. 9, presents the real data sets, where N₁=the training set size, and N₂=the test set size. rcv1 is a collection of manually categorized news wires from Reuters. The original multiple labels have been consolidated into two classes for binary classification. real-sim contains UseNet articles from four discussion groups for simulated auto racing, simulated aviation, real autos, and real aviation. Both rcv1 and real-sim have large feature dimensions but are highly sparse. The rest of the seven data sets are all dense data. rcv1 and real-sim are subsets of the original data sets, from which 500 training instances and 1,000 test instances were randomly sampled. gisette is a handwriting digit recognition system from NIPS 2003 Feature Selection Challenge, from which 500 instances were sub-sampled for training. For testing, the original test set of 1,000 instances was used. duke, leukemia, and colon-cancer are data sets of gene expression profiles for breast cancer, leukemia, and colon cancer respectively. fMRIa, fMRIb, and fMRIc are functional MRI (fMRI) data of brain activities when the subjects are presented with pictures and text paragraphs. This data was compiled and made available by Tom Mitchell's neuroinformatics research group at Carnegie Mellon University's School of Computer Science, http://www.cs.emu.edu/˜tom/. Except for the three fMRI data sets, all the other data sets and their references are available at the LIBSVM website, www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets.

The parameters of HIPAD, ADMM-ENSVM, and LIBSVM were selected through cross validation on the training data. The experiment results are summarized in Table 4, shown in FIG. 10 where the best prediction accuracy for each data set is highlighted in bold. The results show that a HIPAD algorithm according to an embodiment of the invention produced the best overall predication performance. In terms of speed, a HIPAD algorithm according to an embodiment of the invention consistently outperformed ADMM-ENSVM on dense data, which is consistent with the synthetic data results. On the sparse data sets, however, ADMM-ENSVM required less training time than a HIPAD algorithm according to an embodiment of the invention. Note that the OOQP solver in embodiments of a HIPAD is a general QP solver and takes less advantage of the data sparsity and formulation structure than would a customized IPM solver for SVM. The feature support sizes selected by a HIPAD algorithm according to an embodiment of the invention were also competitive or even better than the those selected by ADMM-ENSVM. In most cases, a HIPAD algorithm according to an embodiment of the invention was able to shrink the feature space to below 10% of the original size.

4.3. Simulation for Knowledge Incorporation

Synthetic data was generated to simulate the example presented at the beginning of Domain Knowledge Incorporation section in a high dimensional feature space. Specifically, four groups of multivariate Gaussians K₁, . . . , K₄ were sampled from N(μ₁ ⁺,Σ₁), . . . , N(μ₄ ⁺,Σ₄), N(μ₁ ⁻,Σ₁), . . . , N(μ₄ ⁻,Σ₄) N(μ, Σ) for four disjoint blocks of feature values (x_(K) ₁ , . . . , x_(K) ₄ ). For positive class samples, μ₁ ⁺=2, μ₂ ⁺=0.5, μ₃ ⁺=−0.2, μ₄ ⁺=−1; for negative class samples, μ₁ ⁻=−2, μ₂ ⁻=−0.5, μ₃ ⁻=0.2, μ₄ ⁻=1. All the covariance matrices have 1 on the diagonal and 0.8 everywhere else. The training samples contain blocks K₂ and K₃, while all four blocks are present in the test samples. A random fraction (5%-10%) of the remaining entries in all the samples are generated from a standard Gaussian distribution.

It was challenging to separate the training samples because the values of blocks K₂ and K₃ for the two classes are close to each other. However, blocks K₁ and K₄ in the test samples are well-separated. Hence, if one is given information about these two blocks as general knowledge for an entire population, the resulting SVM classifier would perform much better on the test data. Since the mean values of the distributions from which the entries in K₁ and K₄ are generated are known, the following information about the relationship between the block sample means and class membership can be supplied to the KSVM:

$\begin{matrix} \left. {{\frac{1}{L_{1}}{\sum\limits_{i \in K_{1}}x_{i}}} \geq 4}\Rightarrow \right. & {{x\mspace{14mu}{belongs}\mspace{14mu}{to}\mspace{14mu}{the}\mspace{14mu}{positive}\mspace{14mu}{class}},} \end{matrix}$ $\begin{matrix} \left. {{\frac{1}{L_{4}}{\sum\limits_{i \in K_{4}}x_{i}}} \geq 3}\Rightarrow \right. & {{x\mspace{14mu}{belongs}\mspace{14mu}{to}\mspace{14mu}{the}\mspace{14mu}{negative}\mspace{14mu}{class}},} \end{matrix}$ where L_(j) is the length of the K_(j), j=1, . . . , 4, and the lowercase x_(i) denotes the i-th entry of the sample x. Translating into the notation of (KSVM-P):

${B = \begin{pmatrix} 0 & \underset{K_{1}}{\underset{︸}{- \frac{e^{T}}{L_{1}}}} & 0 & 0 & 0 & 0 \end{pmatrix}},{d = {- 4}},\text{}{D = \begin{pmatrix} 0 & 0 & 0 & 0 & \underset{K_{4}}{\underset{︸}{- \frac{e^{T}}{L_{4}}}} & 0 \end{pmatrix}},{g = {- 3.}}$ The information given here is not precise, in that a sample should belong to the positive (or negative) class only when the corresponding block sample mean well exceeds the distribution mean. This is consistent with real-world situations, where the domain or expert knowledge tends to be conservative and often does not come in exact form.

Two sets of synthetic data were simulated for ENK-SVM as described above, with (N_(train)=200, N_(test)=400, m_(train)=10; 000) for a dataset ksvm-s-10k and (N_(train)=500, N_(test)=1,000, m_(test)=50; 000) for a dataset ksvm-s-50k. The number of features in each of the four blocks (K₁, K₂, K₃, K₄) is 50 for both data sets. The parameters of a HIPAD-ENK (Algorithm 5) according to an embodiment of the invention are set as follows: λ₁=0:1, λ₂=1.ρ₁=ρ₃=50; ρ₂=ρ₄=25 in an ADMM phase according to an embodiment of the invention; c=0.05, ρ₁=ρ₃=5,000, ρ₂=ρ₄=2,500 in an IPM phase according to an embodiment of the invention. For LIBSVM, c=1.0. Experimental results are presented in Table 5, shown in FIG. 11, and FIG. 12. In Table 5, the best prediction accuracy for each data set is highlighted in bold. FIG. 12 depicts plots of various attributes of an ADMM-ENK according to an embodiment of the invention against iterations. The attributes are, from left to right, the number of non-zero features, the number of changes in feature support indices, the objective value, and accuracy. The data set is ksvm-s-10k. A first phase of HIPAD-ENK terminated at iteration 20, when all four blocks of features had been correctly identified and the improvement in accuracy of ADMM-ENK leveled off. A HIPAD-ENK according to an embodiment of the invention is effective in terms of speed, feature selection, and prediction accuracy on these two data sets. FIG. 13 (top) is a plot of the solution w returned by a first phase of a HIPAD-ENK according to an embodiment of the invention on the data set ksvm-s-10k. The non-zero features correspond to blocks K₁ to K₄ in that order. FIG. 13 (bottom) is a plot of the solution w returned by a HIPAD-ENK according to an embodiment of the invention on the data set ksvm-s-10k when d=g=0. The non-zero features correspond to blocks K₂ and K₃. Even though the features in blocks K₁ and K₄ do not discriminate in the training data, a HIPAD-ENK according to an embodiment of the invention was still able to identify all 200 features in the four blocks correctly and exactly, as illustrated in FIG. 13. The expert knowledge not only helps rectify the separating hyperplane so that it generalizes better on the entire population, but also helps the training algorithm realize the significance of the features in blocks K₁ and K₄.

In fact, the feature supports identified by a HIPAD-ENK according to an embodiment of the invention also provides an indication on how useful (or valid) the given information is towards the current classification task. To demonstrate that, d and g above were both changed to 0. This time, a HIPAD-ENK according to an embodiment of the invention did not include K₁ and K₄ in the feature support as shown in FIG. 13. The reason is quite intuitive: the features in K₁ and K₄ of the training data are sparse zero-mean Gaussian noise and have little correlation with the class labels, resulting in a significant number of training samples whose class membership contradicts the supplied knowledge. This shows that an ENK-SVM according to an embodiment of the invention is robust to false information that contradicts the training data.

5. System Implementations

It is to be understood that embodiments of the present invention can be implemented in various forms of hardware, software, firmware, special purpose processes, or a combination thereof. In one embodiment, the present invention can be implemented in software as an application program tangible embodied on a computer readable program storage device. The application program can be uploaded to, and executed by, a machine comprising any suitable architecture.

FIG. 14 is a block diagram of an exemplary computer system for implementing a hybrid interior-point alternating directions algorithm for support vector machines and feature selection according to an embodiment of the invention. Referring now to FIG. 14, a computer system 141 for implementing the present invention can comprise, inter alia, a central processing unit (CPU) 142, a memory 143 and an input/output (I/O) interface 144. The computer system 141 is generally coupled through the I/O interface 144 to a display 145 and various input devices 146 such as a mouse and a keyboard. The support circuits can include circuits such as cache, power supplies, clock circuits, and a communication bus. The memory 143 can include random access memory (RAM), read only memory (ROM), disk drive, tape drive, etc., or a combinations thereof. The present invention can be implemented as a routine 147 that is stored in memory 143 and executed by the CPU 142 to process the signal from the signal source 148. As such, the computer system 141 is a general purpose computer system that becomes a specific purpose computer system when executing the routine 147 of the present invention.

The computer system 141 also includes an operating system and micro instruction code. The various processes and functions described herein can either be part of the micro instruction code or part of the application program (or combination thereof) which is executed via the operating system. In addition, various other peripheral devices can be connected to the computer platform such as an additional data storage device and a printing device.

It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures can be implemented in software, the actual connections between the systems components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.

While the present invention has been described in detail with reference to exemplary embodiments, those skilled in the art will appreciate that various modifications and substitutions can be made thereto without departing from the spirit and scope of the invention as set forth in the appended claims. 

What is claimed is:
 1. A computer-implemented method for training a classifier for selecting features in sparse data sets with high feature dimensionality, the method performed by the computer comprising the steps of: providing a set of N data items x and associated labels y, each data item being a vector in R^(m) where m>>1; minimizing a functional of the data items x and associated labels y ${L\left( {w,b,a,c,\gamma_{1},\gamma_{2}} \right)}:={{\frac{1}{N}{\sum\limits_{i = 1}^{N}a_{i}}} + {\lambda_{1}{c}_{1}} + {\frac{\lambda_{2}}{2}{w}_{2}^{2}} + {\gamma_{1}^{T}\left( {e - {Y\left( {{X\; w} + {b\; e}} \right)} - a} \right)} + {\gamma_{2}^{T}\left( {w - c} \right)} + {\frac{\mu_{1}}{2}{{e - {Y\left( {{X\; w} + {b\; e}} \right)} - a}}_{2}^{2}} + {\frac{\mu_{2}}{2}{{w - c}}_{2}^{2}}}$ formed from a program ${\min\limits_{w,b,a,c}{\frac{1}{N}{\sum\limits_{i = 1}^{N}a_{i}}}} + {\lambda_{1}{c}_{1}} + {\frac{\lambda_{2}}{2}{w}_{2}^{2}}$ subject to $\quad\left\{ \begin{matrix} {{a = {e - {Y\left( {{X\; w} + {b\; e}} \right)}}},} \\ {c = {w.}} \end{matrix} \right.$ to solve for hyperplane w and offset b of a classifier using an alternating direction method of multipliers that successively iteratively approximates w and b, auxiliary variables a and c, and multiplier vectors γ₁ and γ₂, wherein λ₁, λ₂, μ₁, and μ₂ are predetermined constants, e is a unit vector, and X and Y are respective matrix representations of the data items x and their associated labels y; providing non-zero elements of the hyperplane vector w and those components of X and Y that correspond to the non-zero elements of the hyperplane vector w as arguments to a convex quadratic program solver that uses an interior point method to solve for hyperplane vector w and offset b, wherein the hyperplane vector w and offset b define a classifier than can associate each data item x with the correct label y.
 2. The method of claim 1, wherein said alternating direction method of multipliers comprises, at each iteration k: updating (w^(k), b^(k)) by solving ${{\begin{pmatrix} {{\left( {\lambda_{2} + \mu_{2}} \right)I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}X} & {\mu_{1}N} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} {{X^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}X^{T}{Y\left( {a^{k} - e} \right)}} - \gamma_{2}^{k} + {\mu_{2}c^{k}}} \\ {{e^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}e^{T}{Y\left( {a_{k} - e} \right)}}} \end{pmatrix}},$ updating a from $\left. a^{k + 1}\leftarrow{S_{{1/N}\;\mu_{1}}\left( {e + \frac{\gamma_{1}^{k}}{\mu_{1}} - {Y\left( {{X\; w^{k + 1}} + {b^{k + 1}e}} \right)}} \right)} \right.,$ wherein ${S_{\lambda}(\omega)} = \left\{ \begin{matrix} {{\omega - \lambda},} & {{\omega > \lambda},} \\ {0,} & {{0 \leq \omega \leq \lambda},} \\ {\omega,} & {{\omega < 0};} \end{matrix} \right.$ updating c from $\left. c^{k + 1}\leftarrow{T_{\lambda_{1}/\mu_{2}}\left( {\frac{\gamma_{2}^{k}}{\mu_{2}} + w^{k + 1}} \right)} \right.$ wherein T_(λ)(ω)=sgn(ω)max{0,|ω|−λ}; updating γ₁ from γ₁ ^(k+1)←γ₁ ^(k)+ρ₁(e−Y(Xw^(k+1)+b^(k+1)e)−a^(k+1)); and updating γ2 from γ₂ ^(k+1)←γ₂ ^(k)+μ₂(w^(k+1)−c^(k+1)); wherein said updates are repeated until ${\frac{{w^{k + 1} - w^{k}}}{\max\left( {{w^{k}},1} \right)} < ɛ_{tol}},$ wherein ε_(tol) is a predetermined constant.
 3. The method of claim 2, wherein the linear system ${\begin{pmatrix} {{\left( {\lambda_{2} + \mu_{2}} \right)I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}X} & {\mu_{1}N} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} {{X^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}X^{T}{Y\left( {a^{k} - e} \right)}} - \gamma_{2}^{k} + {\mu_{2}c^{k}}} \\ {{e^{T}Y\;\gamma_{1}^{k}} - {\mu_{1}e^{T}{Y\left( {a_{k} - e} \right)}}} \end{pmatrix}$ is solved using a preconditioned conjugate gradient method.
 4. The method of claim 1, wherein said convex quadratic program solver minimizes a quadratic program ½w^(T)w+ce^(T)ξ with respect to w, b, s, and ξ subject to $\quad\left\{ \begin{matrix} {{{{y_{i}\left( {{w^{T}x_{i}} - b} \right)} - s_{i} + \xi_{i}} = 1},} & {{i = 1},\ldots\mspace{14mu},N,} \\ {{s \geq 0},{\xi \geq 0},,} & \; \end{matrix} \right.$ wherein c is a predetermined penalty parameter for ξ, and y_(i) and x_(i) are components of X, Y corresponding to non-zero elements of w.
 5. The method of claim 1, wherein said convex quadratic program solver minimizes a quadratic program ½α^(T)Qα−e^(T)α that is a dual of ½w^(T)w+ce^(T)ξ, wherein ½α^(T)Qα−e^(T)α is minimized with respect to α subject to $\quad\left\{ \begin{matrix} {{{y^{T}\alpha} = 0},} & \; \\ {{0 \leq \alpha_{i} \leq c},} & {{i = 1},\ldots\mspace{14mu},N,} \end{matrix} \right.$ and c is a predetermined penalty parameter for ξ.
 6. A computer-implemented method for training a classifier for selecting features in sparse data sets with high feature dimensionality that incorporates prior domain knowledge, the method performed by the computer comprising the steps of: providing a set of data items x and associated labels y, each data item being a vector in R^(m) where m>>1, a set of positive class constraints Bx≦d

w^(T)x+b≧1, wherein BεR^(k) ¹ ^(×m) and dεR^(k) ¹ , and a set of negative class constraints Dx≦g

w^(T)x+b≦−1, wherein DεR^(k) ² ^(×m) and gεR^(k) ² ; using Farkas lemma to transform the positive class constraints into B^(T)u+w=0, d^(T)u−b+1≦0, u≧0 and negative class constraints into D^(T)v−w=0, g^(T)v+b+1≦0, v≧0; minimizing a functional $L = {{\frac{\lambda_{2}}{2}{w}_{2}^{2}} + {\lambda_{1}{c}_{1}} + {\frac{1}{N}e\;{T(a)}_{+}} + {\frac{\rho_{1}}{2}{{{B^{T}u} + w}}_{2}^{2}} + {\rho_{2}(q)}_{+} + {\frac{\rho_{3}}{2}{{{D^{T}v} - w}}_{2}^{2}} + {\rho_{4}(p)}_{+} + {\gamma_{1}^{T}\left( {e - \left( {{\overset{\_}{X}\; w} + {y\; b}} \right) - a} \right)} + {\frac{\mu_{1}}{2}{{e - \left( {\overset{\_}{X} + {y\; b}} \right) - a}}_{2}^{2}} + {\gamma_{2}^{T}\left( {w - c} \right)} + {\frac{\mu_{2}}{2}{{w - c}}_{2}^{2}} + {\gamma_{3}\left( {{d^{T}u} - b + 1 - q} \right)} + {\frac{\mu_{3}}{2}{{{d^{T}u} - b + 1 - q}}_{2}^{2}} + {\gamma_{4}\left( {{g^{T}v} + b + 1 - p} \right)} + {\frac{\mu_{4}}{2}{{{g^{T}v} + b + 1 - p}}_{2}^{2}}}$ to solve for hyperplane w and offset b of a classifier using an alternating direction method of multipliers that successively iteratively approximates w and b, constraint variables u and v, auxiliary variables a, c, p and q, and Lagrangian multipliers γ₁ ^(k+1), γ₂ ^(k+1), γ₃ ^(k+1), γ₄ ^(k+1), γ₅ ^(k+1), and γ₆ ^(k+1) wherein λ₁, λ₂, ρ₁, ρ₂, ρ₃, ρ₄, μ₁, μ₂, μ₃, and μ₄ are predetermined constants, e is a unit vector, and X and Y are respective matrix representations of the data items x and their associated labels y; and providing non-zero elements of the hyperplane vector w, those components of X and Y that correspond to the non-zero elements of the hyperplane vector w, and η_(u) ⁰≡d^(T)u−b+1,η_(v) ⁰≡g^(T)v+b+1 as arguments to a convex quadratic program solver that uses an interior point method to solve for hyperplane vector w and offset b, wherein the hyperplane vector w and offset b define a classifier than can associate each data item x with the correct label y.
 7. The method of claim 6, wherein updating (w^(k), b^(k)) comprises solving ${\begin{pmatrix} {{\kappa_{1}I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}X} & {{\mu_{1}N} + \kappa_{2}} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} r_{w} \\ r_{b} \end{pmatrix}$ wherein κ₁=λ₂+μ₂+ρ₁+ρ₃, κ₂=μ₃+μ₄, r _(w) =X ^(T) Yγ ₁ ^(k)+μ₁ X ^(T) Y(e−a ^(k))−γ₂ ^(k)+μ₂ c ^(k)+ρ₃ D ^(T) v ^(k)−ρ₁ B ^(T) u ^(k), and r _(b) =e ^(T) Yγ ₁ ^(k)+μ₁ e ^(T) Y(e−a ^(k))+γ₃ ^(k)+μ₃(d ^(T) u ^(k)+1−q ^(k))−γ₄ ^(k)−μ₄(g ^(T) v ^(k)+1−p ^(k)).
 8. The method of claim 7, wherein ${\begin{pmatrix} {{\kappa_{1}I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}X} & {{\mu_{1}N} + \kappa_{2}} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} r_{w} \\ r_{b} \end{pmatrix}$ is solved using a preconditioned conjugate gradient method.
 9. The method of claim 6, wherein updating constraint variables u and v comprises solving (ρ₁ BB ^(T)+μ₃ dd ^(T)+μ₅ I)u ^(k+1/2) =r _(u), wherein r_(u)=−ρ₁Bw^(k+1)+μ₃db^(k+1)+μ₃d(q^(k)−1)−dγ₃ ^(k)+γ₅+μ₅s^(k), and (ρ₃ DD ^(T)+μ₄ gg ^(T)+μ₆ I)v ^(k+1) =r _(v), wherein r_(v)=ρ₃Dw^(k+1)−μ₄gb^(k+1)−gγ₄ ^(k)−μ₄g(1−p^(k))+γ₆+μ₆t^(k), wherein slack variables s and t are updated according to $s^{k + 1} = {{{\max\left( {0,{u^{k + 1} - \frac{\gamma_{5}^{k}}{\mu_{5}}}} \right)}\mspace{14mu}{and}\mspace{14mu} t^{k + 1}} = {{\max\left( {0,{v^{k + 1} - \frac{\gamma_{6}^{k}}{\mu_{6}}}} \right)}.}}$
 10. The method of claim 6, wherein the auxiliary variables a, c, p and q are updated according to ${a^{k + 1} = {s_{\frac{1}{N_{\mu_{1}}}}\left( {e + \frac{\gamma_{1}^{k}}{\mu_{1}} - {Y\left( {{Xw}^{k + 1} + {b^{k + 1}e}} \right)}} \right)}},{c^{k + 1} = {T_{\frac{\lambda_{1}}{\mu_{2}}}\left( {\frac{\gamma_{2}^{k}}{\mu_{2}} + w^{k + 1}} \right)}},{q^{k + 1} = {S_{\frac{\rho_{2}}{\mu_{3}}}\left( {{d^{T}u^{k}} - b^{k + 1} + 1 + \frac{\gamma_{3}^{k}}{\mu_{3}}} \right)}},{p^{k + 1} = {S_{\frac{\rho_{4}}{\mu_{4}}}\left( {{g^{T}v^{k}} - b^{k + 1} + 1 + \frac{\gamma_{4}^{k}}{\mu_{4}}} \right)}},$ wherein ${S_{\lambda}(\omega)} = \left\{ \begin{matrix} {{\omega - \lambda},} & {{\omega > \lambda},} \\ {0,} & {{0 \leq \omega \leq \lambda},} \\ {\omega,} & {{\omega < 0};} \end{matrix} \right.$ and T_(λ)(ω)=sgn(ω)max {0,|ω|−λ}.
 11. The method of claim 6, wherein the Lagrangian multipliers γ₁ ^(k+1), γ₂ ^(k+1), γ₃ ^(k+1), γ₄ ^(k+1), γ₅ ^(k+1), and γ₆ ^(k+1) are updated according to γ₁ ^(K+1)←γ₁ ^(K)+μ₁(e−Y(Xw ^(k+1) +b ^(K+1) e)−a ^(k+1)), γ₂ ^(K+1)←γ₂ ^(K)μ₂(w ^(k+1) −c ^(k+1)), γ₃ ^(K+1)←γ₃ ^(K)μ₃(d ^(T) u ^(k+1) −b ^(k+1) −q ^(k+1)+1), γ₄ ^(K+1)←γ₄ ^(K)μ₄(g ^(T) v ^(k+1) +b ^(k+1) −p ^(k+1)+1), γ₅ ^(K+1)←γ₅ ^(K)μ₅(s ^(k+1) −u ^(k+1)), and γ₆ ^(K+1)←γ₆ ^(K)μ₆(t ^(k+1) −v ^(k+1)).
 12. The method of claim 6, wherein said convex quadratic program solver minimizes a quadratic program ½β^(T)Hβ+f^(T)β with respect to β subject to $\quad\left\{ \begin{matrix} {{a_{l} \leq {A\;\beta} \leq a_{u}},} \\ {{x_{l} \leq \beta \leq x_{u}},} \end{matrix} \right.$ wherein β=(w, b, ξ, u, v, η_(u), η_(v))εR^(M), M=m+n+k₁+k₂+3, f=(0,0,ce ^(T),0,0,ρ₂,ρ₄), $H = \begin{pmatrix} {\left( {1 + \rho_{1} + \rho_{3}} \right)I_{m}} & 0 & {\rho_{1}B^{T}} & {{- \rho_{3}}D^{T}} & 0 \\ \; & 0 & \; & \; & \; \\ {\rho_{1}B} & 0 & {\rho_{1}{BB}^{T}} & \; & 0 \\ {{- \rho_{3}}D} & 0 & \; & {\rho_{3}{DD}^{T}} & 0 \\ \; & 0 & \; & \; & \; \end{pmatrix}$ ${A = \begin{pmatrix} \overset{\_}{X} & y & I_{n} & 0 & 0 & 0 & 0 \\ 0 & {- 1} & 0 & d^{T} & 0 & {- 1} & 0 \\ 0 & 1 & 0 & 0 & g^{T} & 0 & {- 1} \end{pmatrix}},{a_{l} = \begin{pmatrix} e \\ 0 \\ 0 \end{pmatrix}},{a_{u} = \begin{pmatrix} 0 \\ {- 1} \\ {- 1} \end{pmatrix}},$ and the values of x_(l), x_(u)=±∞, respectively, for w, and x_(l)=0, x_(u)=∞ for the other components of β.
 13. A non-transitory program storage device readable by a computer, tangibly embodying a program of instructions executed by the computer to perform the method steps for training a classifier for selecting features in sparse data sets with high feature dimensionality that incorporates prior domain knowledge, the method comprising the steps of: providing a set of data items x and associated labels y, each data item being a vector in R^(m) where m>>1, a set of positive class constraints Bx≦d

w^(T)x+b≧1, wherein BεR^(k) ¹ ^(×m) and dεR^(k) ¹ , and a set of negative class constraints Dx≦g

w^(T)x+b≦−1, wherein DεR^(k) ² ^(×m) and gεR^(k) ² ; using Farkas lemma to transform the positive class constraints into B^(T)u+w=0, d^(T)u−b+1≦0, u≧0 and negative class constraints into D^(T)v−w=0, g^(T)v+b+1≦0, v≧0; minimizing a functional $L = {{\frac{\lambda_{2}}{2}{w}_{2}^{2}} + {\lambda_{1}{c}_{1}} + {\frac{1}{N}{{eT}(a)}_{+}} + {\frac{\rho_{1}}{2}{{{B^{T}u} + w}}_{2}^{2}} + {\rho_{2}(q)}_{+} + {\frac{\rho_{3}}{2}{{{D^{T}v} - w}}_{2}^{2}} + {\rho_{4}(p)}_{+} + {\gamma_{1}^{T}\left( {e - \left( {{\overset{\_}{X}}_{w} + {yb}} \right) - a} \right)} + {\frac{\mu_{1}}{2}{{e - \left( {\overset{\_}{X} + {yb}} \right) - a}}_{2}^{2}} + {\gamma_{2}^{T}\left( {w - c} \right)} + {\frac{\mu_{2}}{2}{{w - c}}_{2}^{2}} + {\gamma_{3}\left( {{d^{T}u} - b + 1 - q} \right)} + {\frac{\mu_{3}}{2}{{{d^{T}u} - b + 1 - q}}_{2}^{2}} + {\gamma_{4}\left( {{g^{T}v} + b + 1 - p} \right)} + {\frac{\mu_{4}}{2}{{{g^{T}v} + b + 1 - p}}_{2}^{2}}}$ to solve for hyperplane w and offset b of a classifier using an alternating direction method of multipliers that successively iteratively approximates w and b, constraint variables u and v, auxiliary variables a, c, p and q, and Lagrangian multipliers γ₁ ^(k+1), γ₂ ^(k+1), γ₃ ^(k+1), γ₄ ^(k+1), γ₅ ^(k+1), and γ₆ ^(k+1) wherein λ₁, λ₂, ρ₁, ρ₂, ρ₃, ρ₄, μ₁, μ₂, μ₃, and μ₄ are predetermined constants, e is a unit vector, and X and Y are respective matrix representations of the data items x and their associated labels y; and providing non-zero elements of the hyperplane vector w, those components of X and Y that correspond to the non-zero elements of the hyperplane vector w, and η_(u) ⁰≡d^(T)u−b+1,η_(v) ⁰≡g^(T)v+b+1 as arguments to a convex quadratic program solver that uses an interior point method to solve for hyperplane vector w and offset b, wherein the hyperplane vector w and offset b define a classifier than can associate each data item x with the correct label y.
 14. The computer readable program storage device of claim 13, wherein updating (w^(k), b^(k)) comprises solving ${\begin{pmatrix} {{\kappa_{1}I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}X} & {{\mu_{1}N} + \kappa_{2}} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} r_{w} \\ r_{b} \end{pmatrix}$ wherein κ₁=λ₂+μ₂+ρ₁+ρ₃, κ₂=μ₃+μ₄, r _(w) =X ^(T) Yγ ₁ ^(k)+μ₁ X ^(T) Y(e−a ^(k))−γ₂ ^(k)+μ₂ c ^(k)+ρ₃ D ^(T) v ^(k)−ρ₁ B ^(T) u ^(k), and r _(b) =e ^(T) Yγ ₁ ^(k)+μ₁ e ^(T) Y(e−a ^(k))+γ₃ ^(k)+μ₃(d ^(T) u ^(k)+1−q ^(k))−γ₄ ^(k)−μ₄(g ^(T) v ^(k)+1−p ^(k)).
 15. The computer readable program storage device of claim 14, wherein ${\begin{pmatrix} {{\kappa_{1}I} + {\mu_{1}X^{T}X}} & {\mu_{1}X^{T}e} \\ {\mu_{1}e^{T}X} & {{\mu_{1}N} + \kappa_{2}} \end{pmatrix}\begin{pmatrix} w^{k + 1} \\ b^{k + 1} \end{pmatrix}} = \begin{pmatrix} r_{w} \\ r_{b} \end{pmatrix}$ is solved using a preconditioned conjugate gradient method.
 16. The computer readable program storage device of claim 13, wherein updating constraint variables u and v comprises solving (ρ₁ BB ^(T)+μ₃ dd ^(T)+μ₅ I)u ^(k+1/2) =r _(u), wherein r _(u)=−ρ₁ Bw ^(k+1)+μ₃ db ^(k+1)+μ₃ d(q ^(k)−1)−dγ ₃ ^(k)+γ₅+μ₅ s ^(k), and (ρ₃ DD ^(T)+μ₄ gg ^(T)+μ₆ I)v ^(k+1) =r _(v), wherein r _(v)=ρ₃ Dw ^(k+1)−μ₄ gb ^(k+1)−μ₄ g(1−p ^(k))+γ₆+μ₆ t ^(k), wherein slack variables s and t are updated according to $s^{k + 1} = {{{\max\left( {0,{u^{k + 1} - \frac{\gamma_{5}^{k}}{\mu_{5}}}} \right)}\mspace{14mu}{and}\mspace{14mu} t^{k + 1}} = {{\max\left( {0,{v^{k + 1} - \frac{\gamma_{6}^{k}}{\mu_{6}}}} \right)}.}}$
 17. The computer readable program storage device of claim 13, wherein the auxiliary variables a, c, p and q are updated according to ${a^{k + 1} = {s_{\frac{1}{N_{\mu_{1}}}}\left( {e + \frac{\gamma_{1}^{k}}{\mu_{1}} - {Y\left( {{Xw}^{k + 1} + {b^{k + 1}e}} \right)}} \right)}},{c^{k + 1} = {T_{\frac{\lambda_{1}}{\mu_{2}}}\left( {\frac{\gamma_{2}^{k}}{\mu_{2}} + w^{k + 1}} \right)}},{q^{k + 1} = {S_{\frac{\rho_{2}}{\mu_{3}}}\left( {{d^{T}u^{k}} - b^{k + 1} + 1 + \frac{\gamma_{3}^{k}}{\mu_{3}}} \right)}},{p^{k + 1} = {S_{\frac{\rho_{4}}{\mu_{4}}}\left( {{g^{T}v^{k}} - b^{k + 1} + 1 + \frac{\gamma_{4}^{k}}{\mu_{4}}} \right)}},$ wherein ${S_{\lambda}(\omega)} = \left\{ \begin{matrix} {{\omega - \lambda},} & {{\omega > \lambda},} \\ {0,} & {{0 \leq \omega \leq \lambda},} \\ {\omega,} & {{\omega < 0};} \end{matrix} \right.$ and T_(λ)(ω)=sgn(ω)max {0,|ω|−λ}.
 18. The computer readable program storage device of claim 13, wherein the Lagrangian multipliers γ₁ ^(k+1), γ₂ ^(k+1), γ₃ ^(k+1), γ₄ ^(k+1), γ₅ ^(k+1), and γ₆ ^(k+1) are updated according to γ₁ ^(K+1)←γ₁ ^(K)+μ₁(e−Y(Xw ^(k+1) +b ^(k+1) e)−a ^(k+1)), γ₂ ^(K+1)←γ₂ ^(K)+μ₂(w ^(k+1) −c ^(k+1)), γ₃ ^(K+1)←γ₃ ^(K)+μ₃(d ^(T) u ^(k+1) −b ^(k+1) −q ^(k+1)+1), γ₄ ^(K+1)←γ₄ ^(K)+μ₄(g ^(T) v ^(k+1) +b ^(k+1) −p ^(k+1)+1), γ₅ ^(K+1)←γ₅ ^(K)+μ₅(s ^(k+1) −u ^(k+1)), γ₆ ^(K+1)←γ₆ ^(K)+μ₆(t ^(k+1) −v ^(k+1)).
 19. The computer readable program storage device of claim 13, wherein said convex quadratic program solver minimizes a quadratic program ½β^(T)Hβ+f^(T)β with respect to β subject to $\quad\left\{ \begin{matrix} {{a_{l} \leq {A\;\beta} \leq a_{u}},} \\ {{x_{l} \leq \beta \leq x_{u}},} \end{matrix} \right.$ wherein β=(w, b, ξ, u, v, η_(u), η_(v))εR^(M), M=m+n+k₁+k₂+3, f=(0,0,ce ^(T),0,0,ρ₂,ρ₄), $H = \begin{pmatrix} {\left( {1 + \rho_{1} + \rho_{3}} \right)I_{m}} & 0 & {\rho_{1}B^{T}} & {{- \rho_{3}}D^{T}} & 0 \\ \; & 0 & \; & \; & \; \\ {\rho_{1}B} & 0 & {\rho_{1}{BB}^{T}} & \; & 0 \\ {{- \rho_{3}}D} & 0 & \; & {\rho_{3}{DD}^{T}} & 0 \\ \; & 0 & \; & \; & \; \end{pmatrix}$ ${A = \begin{pmatrix} \overset{\_}{X} & y & I_{n} & 0 & 0 & 0 & 0 \\ 0 & {- 1} & 0 & d^{T} & 0 & {- 1} & 0 \\ 0 & 1 & 0 & 0 & g^{T} & 0 & {- 1} \end{pmatrix}},{a_{l} = \begin{pmatrix} e \\ 0 \\ 0 \end{pmatrix}},{a_{u} = \begin{pmatrix} 0 \\ {- 1} \\ {- 1} \end{pmatrix}},$ and the values of x_(l), x_(u)=±∞, respectively, for w, and x_(l)=0, x_(u)=∞ for the other components of β. 